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•  ^       Preface 


The  two  volumes  of  Readings  in  Mathematical  Psychology,  of  which  this  is  the 
first,  are  designed  as  source  materials  to  accompany  the  three-volume  Handbook  of 
Mathematical  Psychology.  The  Handbook  authors  were  asked  to  suggest  journal 
references  that  they  considered  particularly  important  in  their  fields;  from  these 
suggestions  the  articles  appearing  in  the  Readings  were  selected.  Because  of  space 
limitations  and  our  own  evaluations,  we  took  considerable  liberty  in  the  selection 
process. 

This  volume  focuses  on  two  main  areas  of  psychology:  psychophysics  and  learning. 
Part  I  consists  of  14  papers  on  measurement,  psychophysics,  and  reaction  time, 
and  Part  II  consists  of  21  papers  on  learning  and  related  mathematical  and  statistical 
topics.  These  papers  are  referenced  in  Chapters  1-6  and  8-10  of  the  Handbook. 
Volume  II  of  the  Readings  contains  papers  relevant  to  other  Handbook  chapters. 

Papers  that  have  appeared  in  hard-cover  publications,  such  as  Decision  Proc- 
esses (Wiley,  1954)  and  Studies  in  Mathematical  Learning  Theory  (Stanford,  1959), 
were  intentionally  excluded  from  the  present  Readings.  It  is  our  view  that  every 
mathematical  psychologist  should  have  such  books  on  his  bookshelf.  They  are  listed 
after  the  preface  to  Volume  I  of  the  Handbook. 

Of  the  35  papers  reproduced  in  this  volume,  11  are  from  Psychometrika,  10  are 
from  Psychological  Review,  3  from  the  Journal  of  Experimental  Psychology,  3  from  the 
Journal  of  the  Acoustical  Society  of  America,  2  from  the  Pacific  Journal  of  Mathematics, 
and  one  each  from  the  Bulletin  of  Mathematical  Biophysics,  the  Proceedings  of  the 
National  Academy  of  Sciences,  Transactions  of  the  Institute  of  Radio  Engineers,  the 
Journal  of  Symbolic  Logic,  the  Annals  of  Mathematical  Statistics,  and  a  private  docu- 
ment of  the  U.S.  Air  Force.  Gratitude  is  expressed  for  permissions  to  reproduce 
these  papers  here. 

The  35  papers  represent  the  work  of  30  different  contributors.  It  may  be  of 
interest  to  note  that  17  of  these  are  professional  psychologists,  8  are  mathematicians 
or  statisticians,  3  are  engineers,  and  2  are  philosophers.  One  of  the  papers  was 
published  in  1947,  and  the  others  are  rather  uniformly  spread  over  the  years  1950-1962. 

The  compilation  of  a  book  of  this  sort  requires  a  surprising  amount  of  corre- 
spondence. For  handling  this  and  other  details,  the  editors  wish  to  thank  Miss  Ada 
Katz. 

R.  Duncan  Luce 

Philadelphia,  Pennsylvania  Robert  R.  Bush 

March,  1963  Eugene  Galanter 
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Part  I 

MEASUREMENT,  PSYCHOPHYSICS, 
AND  REACTION  TIME 


AN  AXIOMATIC  FORMULATION  AND  GENERALIZATION 
OF  SUCCESSIVE  INTERVALS  SCALING* 

Ernest  Adams 
university  of  california,  berkeley 

AND 

Samuel  Messick 
educational  testing  service 

A  formal  set  of  axioms  is  presented  for  the  method  of  successive  intervals, 
and  directly  testable  consequences  of  the  scaling  assumptions  are  derived. 
Then  by  a  systematic  modification  of  basic  axioms  the  scaling  model  is  gener- 
alized to  non-normal  stimulus  distributions  of  both  specified  and  unspecified 
form. 

Thurstone's  scaling  models  of  successive  intervals  [7,  21]  and  paired 
comparisons  [17,  24]  have  been  severely  criticized  because  of  their  dependence 
upon  an  apparently  untestable  assumption  of  normahty.  This  objection 
was  recently  summarized  by  Stevens  [22],  who  insisted  that  the  procedure 
of  using  the  variability  of  a  psychological  measure  to  equalize  scale  units 
"smacks  of  a  kind  of  magic — a  rope  trick  for  climbing  the  hierarchy  of  scales. 
The  rope  in  this  case  is  the  assumption  that  in  the  sample  of  individuals 
tested  the  trait  in  question  has  a  canonical  distribution,  (e.g.,  'normal') 
•  •  •  .  There  are  those  who  believe  that  the  psychologists  who  make  assump- 
tions whose  validity  is  beyond  test  are  hoist  with  their  own  petard  •  •  •  ." 
Luce  [13]  has  also  viewed  these  models  as  part  of  an  "extensive  and  unsightly 
literature  which  has  been  largely  ignored  by  outsiders,  who  have  correctly 
condemned  the  ad  hoc  nature  of  the  assumptions." 

Gulliksen  [11],  on  the  other  hand,  has  explicitly  discussed  the  testability 
of  these  models  and  has  suggested  alternative  procedures  for  handling  data 
which  do  not  satisfy  the  checks.  Empirical  tests  of  the  scaling  theory  were 
also  mentioned  or  implied  in  several  other  accounts  of  the  methods  [e.g., 
8,  9,  12,  15,  21,  25].  Criteria  of  goodness  of  fit  have  been  presented  [8,  18], 
which,  if  met  by  the  data,  would  indicate  satisfactory  scaling  within  an 
acceptable  error.  Random  errors  and  sampling  fluctuations,  as  well  as  sys- 
tematic deviations  from  scaling  assumptions,  are  thereby  evaluated  by  these 

*This  paper  was  written  M'hile  the  authors  were  attending  the  1957  Social  Science 
Research  Council  Summer  Institute  on  Applications  of  Mathematics  in  Social  Science. 
The  research  was  supported  in  part  by  Stanford  University  under  Contract  NR  171-034 
with  Group  Psychology  Branch,  Office  of  Naval  Research,  by  Social  Science  Research 
Council,  and  by  Educational  Testing  Service.  The  authors  wish  to  thank  Dr.  Patrick  Suppes 
for  his  interest  and  encouragement  throughout  the  writing  of  the  report  and  Dr.  Harold 
Gulliksen  for  his  helpful  and  instructive  comments  on  the  manuscript. 

This  article  appeared  in  Psychomelrika,  1958,  23,  355-368.    Reprinted  with  permission. 
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over-all  internal  consistency  checks.  However,  tests  of  the  scaling  assumptions, 
and  in  particular  the  normality  hypothesis,  have  not  yet  been  explicitly 
derived  in  terms  of  the  necessary  and  sufficient  conditions  required  to  satisfy 
the  model.  Recently  Rozeboom  and  Jones  [20]  and  Hosteller  [16]  have 
investigated  the  sensitivity  of  successive  intervals  and  paired  comparisons, 
respectively,  to  a  normality  requirement,  indicating  that  departures  from 
normality  in  the  data  are  not  too  disruptive  of  scale  values  with  respect  to 
goodness  of  fit,  but  direct  empirical  consequences  of  the  assumptions  of  the 
model  were  not  specified  as  such. 

The  present  axiomatic  characterization  of  a  well-established  scaling  model 
was  attempted  because  of  certain  advantages  which  might  accrue:  (a)  an 
ease  of  generalization  that  follows  from  a  precise  knowledge  of  formal  prop- 
erties by  systematically  modifying  axioms,  and  (6)  an  ease  in  making  com- 
parisons between  the  properties  of  different  models.  The  next  section  deals 
with  the  axioms  for  successive  intervals  and  serves  as  the  basis  for  the  ensuing 
section,  in  which  the  model  is  generalized  to  non-normal  stimulus  distribu- 
tions. One  outcome  of  the  following  formalization  which  should  again  be 
highlighted  is  that  the  assumption  of  normality  has  directly  verifiable  con- 
sequences and  should  not  be  characterized  as  an  untestable  supposition. 

Thurstone's  Successive  Intervals  Scaling  Model 
The  Experimental  Method 

In  the  method  of  successive  intervals  subjects  are  presented  with  a 
set  of  n  stimuli  and  asked  to  sort  them  into  k  ordered  categories  with  respect 
to  some  attribute.  The  proportion  of  times  f,i  that  a  given  stimulus  s  is 
placed  in  category  i  is  determined  from  the  responses.  If  it  is  assumed  that  a 
category  actually  represents  a  certain  interval  of  stimulus  values  for  a  subject, 
then  the  relative  frequency  with  which  a  given  stimulus  is  placed  in  a  par- 
ticular category  should  represent  the  probability  that  the  subject  estimates 
the  stimulus  value  to  lie  within  the  interval  corresponding  to  the  category. 
This  probability  is  in  turn  simply  the  area  under  the  distribution  curve  inside 
the  interval.  So  far  scale  values  for  the  end  points  of  the  intervals  are  unknown, 
but  if  the  observed  probabilities  for  a  given  stimulus  are  taken  to  represent 
areas  under  a  normal  curve,  then  scale  values  may  be  obtained  for  both  the 
category  boundaries  and  the  stimulus. 

Scale  values  for  interval  boundaries  are  determined  by  this  model, 
and  interval  widths  are  not  assumed  equal,  as  in  the  method  of  equal  appearing 
intervals.  Essentially  equivalent  procedures  for  obtaining  successive  intervals 
scale  values  have  been  presented  by  Saffir  [21],  Guilford  [10],  Hosier  [15], 
Bishop  [3],  Attneave  [2],  Garner  and  Hake  [9],  Edwards  [7],  Burros  [5],  and 
Rimoldi  [19].  The  basic  rationale  of  the  method  had  been  previously  outlined 
by  Thurstone  in  his  absolute  scaling  of  educational  tests  [23,  26].  Gulliksen 
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[12],  Diederich,  Messick,  and  Tucker  [6],  and  Bock  [4]  have  described  least 
square  solutions  for  successive  intervals,  and  Rozeboom  and  Jones  [20] 
presented  a  derivation  for  scale  values  which  utilized  weights  to  minimize 
sampling  errors.  Most  of  these  papers  contain  the  notion  that  the  assumption 
of  normality  can  be  checked  by  considering  more  than  one  stimulus.  Although 
one  distribution  of  relative  frequencies  can  always  be  converted  to  a  normal 
curve,  it  is  by  no  means  always  possible  to  normalize  simultaneously  all 
of  the  stimulus  distributions,  allowing  unequal  means  and  variances,  on 
the  same  base  line.  The  specification  of  exact  conditions  under  which  this  is 
possible  will  now  be  attempted.  In  all  that  follows,  the  problem  of  sampling 
fluctuations  is  largely  ignored,  and  the  model  is  presented  for  the  errorless 
case. 

The  Formal  Model 

The  set  of  stimuli,  denoted  S,  has  elements  r,  s,  u,  v,  •  •  •  .  There  is  no 
limit  upon  the  admissible  number  of  stimuli,  although  for  the  purpose  of 
testing  the  model,  S  must  have  at  least  two  members.  For  each  stimulus 
s  in  S,  and  each  category  i  =  1,  2,  •  •  •  ,  k,  the  relative  frequency  /»,<  with 
which  stimulus  s  is  placed  in  category  i  is  given.  Formally  /  is  a  function 
from  the  Cartesian  product  of  >S  X  {1,  2,  •  •  •  ,  k}  to  the  real  numbers.  More 
specifically,  it  will  be  the  case  that  for  each  s  in  S,  f^  will  be  a  probability 
distribution  over  the  set  {1,  2,  •  •  •  ,  k}.  For  the  sake  of  an  explicit  statement 
of  the  assumptions  of  the  model,  this  fact  will  appear  as  an  axiom,  although 
it  must  be  satisfied  by  virtue  of  the  method  of  determining  the  values  of  /^ ,,-  . 

Axiom  1.  /  is  a  function  mapping  *S  X  { 1,  •  •  •  ,  A;}  into  the  real  numbers 
such  that  for  each  s  in  S,  f^  is  a  probability  distribution  over  {1,  ■  •  •  ,  k];'i.e., 
for  each  sin.  S  and  i  =  1,  •  •  •  ,  k,  0  <  j,^i  <  1  and  Xli=i  f^.i  =  1- 

The  set  S  and  the  function  /  constitute  the  ohservables  of  the  model. 
Two  more  concepts  which  are  not  directly  observed  remain  to  be  introduced. 
The  first  of  these  is  a  set  of  numbers  ti  ,  •  •  •  ,  t^k-l)  ,  which  are  the  end  points 
of  the  intervals  corresponding  to  the  categories.  It  is  assumed  that  these 
intervals  are  adjacent  and  that  they  cover  the  entire  real  line.  Formally, 
it  will  simply  be  assumed  that  ^i  ,  •  •  •  ,  ^(^-1)  are  an  increasing  series  of  real 
numbers. 

Axiom  2.  Interval  boundaries  ti  ,  •  •  •  ,  ^(i-d  are  real  numbers,  and  for 

i  =  2,  •■■  ,{k  -  1),     ^(.-1)  <U. 

Finally,  the  distribution  corresponding  to  each  stimulus  s  m  *S  is  repre- 
sented by  a  normal  distribution  function  A'^,  . 

Axiom  3.  iV  is  a  function  mappmg  S  into  normal  distribution  functions 
over  the  real  line. 

Axioms  1-3  do  not  state  fully  the  mathematical  properties  required  for 
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the  set  S,  the  numbers  ti  ,  •  •  -  ,  t(^k-i)  ,  and  the  functions  N,  .  In  the  interests 
of  completeness,  these  will  be  stated  in  the  following  Axiom  0,  which  for 
formal  purposes  should  be  referred  to  instead  of  Axioms  1-3. 

Axiom  0.  S  is  a  non-empty  set.  A;  is  a  positive  integer.  /  is  a  function 
mapping  S  X  {1,  •  •  •  ,  k]  into  the  closed  interval  [0,  1],  such  that  for  each  s 
i^  S)  ^i=i  fs.i  =  1-  For  i  =  I,  •  •  •  ,  {k  —  1),  ti  is  a  real  number,  and  for 
i  =  1,  •  ■  •  ,  (k  —  2),  ti  <  ti+i  .  N  is  a  function  mapping  S  into  the  set  of 
normal  distribution  functions  over  the  real  numbers. 

Axioms  2  and  3  state  only  the  set-theoretical  character  of  the  elements 
ti  and  N,  ,  and  have  no  intuitive  empirical  content.  The  central  hypothesis 
of  the  theory  states  the  connection  between  the  observed  relative  frequencies 
/,  .,•  and  the  assumed  underlying  distributions  A^,  . 

Axiom  4.  (Fundamental  hypothesis)  For  each  s  in *S audi  =!,•••  ,  k, 
/,.,■  =   f'    NXcc)  da. 

(Note  that  if  i  =  1,  ta-D  is  set  equal  to  —  oo,  and  ii  i  =  k,  ti  =   oo.) 

Axioms  1-4  state  the  formal  assumptions  of  the  theory  although,  because 
the  fundamental  hypothesis  (Axiom  4)  involves  the  unobservables  N,  and 
ti  ,  it  is  not  directly  testable  in  these  terms.  The  question  of  testing  the  model 
wiU  be  discussed  in  the  next  section.  Scale  values  for  the  stimuli  have  not 
yet  been  introduced.  These  are  defined  to  be  equal  to  the  means  of  the  distri- 
butions N,  ,  and  hence  are  easily  derived.  The  function  v  will  represent  the 
scale  values  of  the  stimuli. 

Definition  1.  y  is  the  function  mapping  S  into  the  real  numbers  such 
that  for  each  s  in  S,  v,  is  the  mean  of  N,  ;  i.e.. 


V,  =   I     aNXoi)  da. 


Testing  the  Model 

The  model  will  be  said  to  fit  exactly  if  all  of  the  testable  consequences 
of  Axioms  1-4  are  verified.  Testable  consequences  of  these  axioms  will  be 
those  consequences  which  are  formulated  solely  in  terms  of  the  observable 
concepts  S  and  /,  or  of  concepts  which  are  definable  in  terms  of  S  and  /. 
If  no  further  assumptions  are  made  about  an  independent  determination  of 
ti  ,  •  •  •  ,  ta-i)  and  N,  then  the  testable  consequences  are  just  those  which 
follow  about  /  and  *S  from  the  assumption  that  there  exist  numbers 
^1  ,  •  ■  •  ,  t(k-i)  and  functions  N^  which  satisfy  Axioms  1-4.  In  this  model, 
it  is  possible  to  give  an  exhaustive  description  of  the  testable  consequences; 
hence  this  theory  is  axiomatizable  in  the  sense  that  it  is  possible  to  formulate 
observable  conditions  which  are  necessary  and  sufficient  to  insure  the  existence 
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of  the  numbers  ti  and  functions  A^,  .  The  derivation  of  these  conditions  will 
proceed  by  stages. 

Let  p,  ,i  be  the  cumulative  distribution  of  the  function  /  for  stimulus  s 
and  interval  i. 


}  '"i 


Definition  2.  For  each  s  in  5  and  2  =  1,  •  •  •  ,k 

Ps.i  =    Z/..;  • 

It  follows  from  this  definition  and  Axiom  4  that  for  each  s  in  5  and 
t  =   1,  •  •  •  ,  k, 

(1)  p...  =  j     NXa)  da. 

Using  the  table  for  the  cumulative  distribution  of  the  normal  curve  with 
zero  mean  and  unit  variance,  the  numbers  z,  ,,•  may  be  determined  such  that 

®  ^■-  =  ^/"''^-'""'*- 

(Note  that  for  i  =  k,  z,  ,i  will  be  infinite.)  N,  is  a  normal  distribution  function 
and  must  have  the  form: 


o-,V27r  L     2c7-,  J 


(3)  NXa)  = 

where  a]  is  the  variance  of  A'',  about  its  mean  v,  .  Equations  (1),  (2),  and  (3) 
yield  the  conclusion  that  for  each  s  in  *S  and  i  =  1,  •  •  •  ,  k, 

(4)  2.,,  =  {ti  -  v,)A.  . 

In  (4)  the  numbers  z,  ,,■  on  the  left  are  known  transformations  of  the 
observed  proportions  /,  ,,•  ,  while  the  numbers  ti  ,  v,  and  cr,  are  unknow^n. 
Suppose  however  that  r  is  a  fixed  member  of  the  class  S  of  stimuli;  it  is 
possible  to  solve  (4)  for  all  the  unknowns  in  terms  of  the  known  z's,  and 
Vr  and  <Tr  ,  the  mean  and  standard  deviation  of  the  fixed  stimulus  r.  These 
solutions  are 

(5)  ti     =     (TrZr.i    -\-    Vr         fOr         l     =     1  ,     '   '   '     ,    {k    —     1)  ] 

(6)  a.  =  o-i^'"''  _  ^'"•' j     for    St  S,     and     i  9^  j; 

The  necessary  and  sufficient  condition  that  the  system  of  equations  (4) 
have  a  solution,  and  hence  that  ti  ,  v,  and  a,  be  determinable  using  (5),  (6), 
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and  (7),  is  that  all  3,  .<  be  linear  functions  of  each  other  in  the  following  sense. 
For  all  r  and  s  in  S,  there  exist  real  numbers  a^.,  and  6^,,  such  that  for  each 
i  =  I,  ■■■  ,k, 

(8)  2,..-     =     ar.sZr.i    +     &r.»     • 

The  required  numbers  a, ,,  and  &r .«  exist  if  and  only  if  for  each  r  and  s,  the  ratio 

Zr    i     —    Zr    i  1 

(9)  -^ —  =  a,,r  =  

Zs,i    —    Z,j  Gr,, 

is  independent  of  i  and  j. 

If  constants  a^.,  and  6^,.  satisfying  (8)  exist,  then  they  are  related  to 
the  scale  values  v,.  and  the  standard  deviations  o-^  in  a  simple  way.  For  each 
r,  s  in  S, 

(10)  ttr.s     =    (^r/(rs    , 

and 

(11)  &r.»     =     (i^r    -    y.)/0-a    • 

Clearly  the  arbitrary  choice  of  the  constants  Vr  and  o-^  in  (5),  (6),  and  (7) 
represents  the  arbitrary  choice  of  origin  and  unit  in  the  scale.  Since  scale 
values  of  U  and  v,  are  uniquely  determined  once  v^  and  o-^  are  chosen,  the 
scale  values  are  unique  up  to  a  linear  transformation;  i.e.,  an  interval  scale 
of  measurement  has  been  determined.  It  should  be  noted  that  this  model 
does  not  require  equality  of  standard  deviations  (or  what  Thurstone  has 
called  discriminal  dispersions  [25])  but  provides  for  their  determination 
from  the  data  by  equation  (6).  This  adds  powerful  flexibility  in  its  possible 
applications. 

It  remains  only  to  make  a  remark  about  the  necessary  and  sufficient 
condition  which  a  set  of  observed  relative  frequencies  /,.<  must  fulfill  in 
order  to  satisfy  the  model.  This  necessary  and  sufficient  condition  is  simply 
that  the  numbers  z,,i  ,  which  are  defined  in  terms  of  the  observed  relative 
frequencies,  be  linearly  related  as  expressed  in  (8),  This  can  be  determined 
by  seeing  if  the  ratios  computed  from  (9)  are  independent  of  i  and  j,  or  by 
evaluating  for  aU  s,  r  the  linearity  of  the  plots  of  z,,i  against  Zr,i  .  Hence 
for  this  model  there  is  a  simple  decision  procedure  for  determining  whether 
or  not  a  given  set  of  errorless  data  fits. 

If  z,  ,i  and  Zr  ,<  are  found  to  be  linearly  related  for  all  s,  r  in  S,  the  assump- 
tions of  the  scaling  model  are  verified  for  that  data.  If  the  z's  are  not  linearly 
related,  then  assumptions  have  been  violated.  For  example,  the  normal  curve 
may  not  be  an  appropriate  distribution  function  for  the  stimuli  and  some  other 
function  might  yield  a  better  fit  [cf.  11,  12].  Or  perhaps  the  responses  cannot 
be  summarized  unidimensionally  in  terms  of  projections  on  the  real  line 
representing  the  attribute  [11].  If  the  stimuli  are  actually  distributed  in  a 
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multidimensional  space,  then  judgments  of  projections  on  one  of  the  attributes 
may  be  differentially  distorted  by  the  presence  of  variations  in  other  dimen- 
sions. This  does  not  mean  that  stimuli  varying  in  several  dimensions  may 
not  be  scaled  satisfactorily  by  the  method  of  successive  intervals,  but  rather 
that  if  the  model  does  not  fit,  such  distortion  effects  might  be  operating. 
A  multidimensional  scaling  model  [14]  might  prove  more  appropriate  in 
such  cases. 

In  practice  the  set  of  points  (2;^,,-  ,  Zs,i)  ior  i  =  2,  •  •  •  ,  (k  —  1)  will 
never  exactly  fit  the  straight  line  of  (8)  but  will  fluctuate  about  it.  It  remains 
to  be  decided  whether  this  fluctuation  represents  systematic  departure 
from  the  model  or  error  variance.  In  the  absence  of  a  statistical  test  for 
linearity,  the  decision  is  not  precise,  although  the  linearity  of  the  plots  may 
still  be  evaluated,  even  if  only  by  eye.  One  approach  is  to  fit  the  obtained 
points  to  a  straight  line  by  the  method  of  least  squares  and  then  evaluate 
the  size  of  the  obtained  minimum  error  [4,  6,  12].  In  any  event,  the  test  of 
the  model  is  exact  in  the  errorless  case,  and  the  incorporation  of  a  suitable 
sampling  theory  would  provide  decision  criteria  for  direct  experimental 
applications. 

A  Generalization  of  the  Successive  Intervals  Model 

The  successive  mtervals  model  discussed  in  the  previous  section  can 
be  generalized  in  a  number  of  ways.  One  generalization,  treated  in  detail 
by  Torgerson  [27],  considers  each  interval  boundary  f,-  to  be  the  mean  of  a 
subjective  distribution  with  positive  variance.  Another  approach  toward 
generalizing  the  model  is  to  weaken  the  requirement  of  normal  distributions 
of  stimulus  scale  values.  Formally,  this  generalization  amounts  to  enlarging 
the  class  of  admissible  distribution  functions.  Instead  of  specifying  exactly 
which  distribution  functions  are  allowed  in  the  generalization,  assume  an 
arbitrary  set  \p  of  distributions  over  the  real  line,  to  which  it  is  required  that 
the  stimulus  distributions  belong.  In  formalizing  the  model,  \p  is  characterized 
simply  as  a  set  of  distribution  functions  over  the  real  line.  Axiom  3  may  be 
replaced  by  a  new  axiom  specifying  the  nature  of  the  class  4/  and  stating 
that  C  is  a  function  mapping  S  into  elements  of  \p;  i.e.,  for  each  s  in  S,  C, 
(interpreted  as  the  distribution  of  the  stimulus  s)  is  a  member  of  \l/. 

One  final  assumption  about  the  class  \{/  needs  to  be  added:  namely, 
if  \p  contains  a  distribution  function  C,  then  it  must  contain  all  linear  trajis- 
formations  of  C.  A  linear  transformation  of  a  distribution  function  C  is 
defined  as  any  other  distribution  function  C  which  can  be  obtamed  from 
C  by  a  shift  of  origin  and  a  scale  transformation  of  the  horizontal  axis.  A 
stretch  along  the  horizontal  axis  must  be  compensated  for  by  a  contraction 
on  the  vertical  axis  in  order  that  the  transformed  function  also  be  a  probability 
density  function.  Algebraically,  these  transformations  have  the  following 
form.  Let  D  and  D'  be  distribution  functions,  then  D'  is  a  linear  transformation 
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of  D  if  there  exists  a  positive  real  number  a  and  a  real  number  h  such  that 
for  all  X, 

D'{x)  =  aD{ax  +  h). 

This  is  not  truly  a  linear  transformation  because  of  multiplication  by  a  on  the 
ordinate,  but  for  lack  of  a  better  term  this  phrase  is  used.  The  reason  for 
requiring  that  the  class  ^  of  distribution  functions  be  closed  under  linear 
transformations  is  to  insure  that  in  any  determination  of  stimulus  scale 
values  it  will  be  possible  to  convert  them  by  a  linear  transformation  into 
another  admissible  set  of  scale  values;  i.e.,  the  stimulus  values  obtained  are 
to  form  an  interval  scale.  If  the  set  ^  is  not  closed  under  linear  transformations, 
in  general  it  will  not  be  possible  to  alter  the  scale  by  an  arbitrary  lineaf 
transformation. 

Axiom  3'.  ^  is  a  set  of  distribution  functions  over  the  real  numbers,  and 
C  is  a  function  mapping  S  into  \p.  For  all  D  in  ^,  if  a  is  a  positive  real  number 
and  b  is  a  real  number,  then  the  function  D'  such  that  for  all  x, 

D'{x)  =  aD(ax  +  6) 

is  a  member  of  ^. 

It  is  to  be  observed  that  the  set  of  normal  distributions  has  the  required 
property  of  being  closed  under  linear  transformations.  This  set  is  in  fact  a 
minimal  class  of  this  type,  in  the  sense  that  aU  normal  distribution  functions 
can  be  generated  from  a  single  normal  distribution  function  by  linear  trans- 
formations. 

Finally,  Axiom  4  is  replaced  by  an  obvious  generalization  which  specifies 
the  connection  between  the  observed  /«,.•  ,  the  distribution  functions  C,  , 
and  the  interval  end  points  ti  . 


Axiom  4'.  For  each  s  in  *S  and  i  =  1,  •  •  •  ,  k, 
f,.i  =   /       C.ix)  dx. 


(Here  again  ^o  =    —  °°   and  f*  =    0° .)  The  stimulus  values  are  defined  as 
before  to  be  the  means  of  the  distribution  functions  C,  . 

Definition  1'.  v  is  the  function  mapping  S  into  the  real  numbers  such 
that  for  each  s  in  S,  v,  is  the  mean  of  C,  ,  i.e.. 


=/: 


xC,{^)  dx. 


The  problem  now  is  to  specify  the  class  of  admissible  distribution  func- 
tions \p.  Each  specification  of  this  class  amounts  to  a  theory  about  the  under- 
lying stimulus  distributions.  If  the  hypothesis  of  normality  is  altered  or 


ERNEST   ADAMS   AND    SAMUEL    MESSICK  11 

weakened,  what  assumptions  can  replace  it?  Omitting  any  assumption  about 
the  form  of  the  distribution  functions  would  amount  to  letting  \^  be  the  set 
of  all  distribution  functions  over  real  numbers.  If  no  assumption  whatever 
is  made  about  the  forms  of  C,  ,  then  the  theory  is  very  weak.  Every  set  of 
data  will  fit  the  theory,  and  the  scale  values  of  f,-  can  be  determined  only 
on  an  ordinal  scale.  It  is  always  possible  to  determine  distribution  functions 
C,  satisfying  Axiom  4'  for  arbitrarily  specified  ti  .  To  show  this  it  is  only 
necessary  to  construct  them  in  accordance  with  the  following  definition. 


C.{x)  =  J 


0  othenvise. 


1  <  a;  <  z,         i  =  1,  •  •  •  ,  k, 


Non-normal  Distributions  oj  Specified  Form 

It  is  clearly  necessary  to  make  some  restrictions  on  i/-  if  the  scale  values 
are  to  be  determined  uniquely  up  to  a  linear  transformation.  It  will  next 
be  shown  that  any  minimal  class  of  distribution  functions,  in  the  sense  of 
a  class  all  of  whose  members  are  generated  from  a  single  member  by  linear 
transformations,  has  the  desired  property  of  generating  a  linear  scale  of 
stimulus  values  when  the  model  fits.  For  the  present  assume  that  \p  is  a 
minimal  class  of  distribution  functions. 

Assumption  1.  There  exists  a  distribution  function  D  such  that  for  all 
distribution  functions  D'  in  \}/  there  exists  a  positive  real  number  a  and  a 
real  number  h  such  that  for  all  x, 

D'{x)  =  aD{ax  +  h). 

To  show  that  if  Assumption  1  is  satisfied  the  scale  values  are  obtained 
on  an  interval  scale,  we  proceed  as  follows.  Axiom  3'  and  Assumption  1  imply 
that  for  all  s  in  S,  there  exists  a  positive  real  number  a,  and  a  real  number 
h,  such  that  for  all  x, 

(12)  CXx)  =  aMa.x  +  5J, 

where  the  function  D  on  the  right  side  of  (12)  is  a  fixed  function  of  some 
specified  form  linearly  related  to  all  the  functions  D'  in  ^.  According  to 
Axiom  4',  then,  for  each  s  in  S,  and  2  =  1,  •  •  •  ,k, 

(13)  /.,.•  =    r    aMa.x  +  6,)  dx. 

If  T  is  the  cumulative  distribution  corresponding  to  D,  and  the  cumulative 
distributions  p,  ,<  are  defined  as  before,  then 


p,,i  =   I      a,D(a,x  +  b,)  dx 


(14) 

=  Tr(aJ,  +  5, 
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Assuming  that  the  function  x  is  strictly  monotone  increasing,  then,  knowing 
the  form  of  function  Z),  it  is  possible  to  determine  uniquely  the  numbers 
z,  ,,•  such  that  for  each  s  in  aS  and  i  =  1,  •  •  •  ,  k, 

(15)  Ps.i  =  Trfe.i). 
Equations  (14)  and  (15)  imply  immediately  that 

(16)  z,,i  =  aji  +  h, 

for  all  s  in  S  and  i  =  1,  •  •  •  ,  k.  It  is  clear  from  (15)  why  it  is  necessary 
to  assume  that  tt  is  strictly  monotone  increasing.  If  it  were  not,  there  would 
not  in  general  be  a  unique  Zs,i  determined  by  (15);  hence  the  scale  values 
based  on  z,,i  would  not  be  unique.  It  is  also  seen  that  (4),  relating  Zs,i  to 
ti  ,  Vs  and  a,  in  the  normal  distribution  model,  is  simply  a  particular  case  of 

(16)  here.  The  connection  between  a^  ,  b,  and  a-,  and  v,  is 

0-,  =  1/a,  ,        Vs  =  —hja,  . 

In  (15),  as  in  the  corresponding  set  of  equations  obtained  from  the 
normality  assumption,  the  numbers  on  the  left  are  known,  and  the  numbers 
on  the  right  are  unknown.  As  before,  if  two  numbers  a^  and  6^  are  arbitrarily 
determined  for  a  fixed  stimulus  r,  then  the  ti  are  uniquely  determined  by  the 
following  equation. 

(17)  U  =  {zr,i  -  &.)M  ,        i  =  I,  •••  ,k. 

The  scale  values  for  the  stimuli,  however,  cannot  be  directly  determined 
from  the  coefficients  Zs,i  ,  a^  and  h^  without  first  specifying  the  mean  m 
of  the  basic  distribution  D.  If  m  is  the  mean  of  D,  then  v^  ,  which  was  defined 
as  the  mean  of  C^  ,  is  determined  by 

(18)  Vs  =  (m  —  6,)/a,  . 

Both  the  tts  and  the  6,  in  (17)  can  be  determined  in  terms  of  z,,i  ,  a^  and  6,  , 

(19)  and  (20);  hence  v^  is  immediately  determinable  in  terms  of  just  these 
quantities  by  (18). 

(19)  a,  =  a.  ''"'•  "'''•'  , 

(20)  hs=Zs.i-  (^'''  ~  ^"\zr,<  -  h). 

It  is  clear  then  that  the  scale  values  ti  and  y,  are  determined  up  to  a 
linear  transformation.  Furthermore,  necessary  and  sufficient  conditions 
that  a  set  of  data  fit  the  model  are  simply  that  the  ratios  of  differences  in 
g's  on  the  right  in  (19)  be  independent  of  i  and  j;  i.e.,  that  the  z's  be  linearly 
related. 
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The  Forms  of  the  Distributions  Unspecified 

A  final  generalization  to  be  considered  is  one  in  which  Assumption  1 
holds,  but  where  the  form  of  the  generating  function  D  is  not  specified; 
i.e.,  it  is  assumed  that  the  underlying  distributions  all  belong  to  one  minimal 
class,  but  that  the  class  can  be  generated  by  any  distribution  function  D. 
Interestingly  enough,  in  this  case  it  is  still  possible  to  test  the  model  and 
to  obtain  more  than  ordinal  information  about  the  scale  values.  If  it  is 
assumed  that  the  stimulus  distributions  all  belong  to  one  minimal  family 
generated  by  a  function  D,  but  D  is  unknown,  all  of  the  deductions  up  through 
(14)  go  through,  although  in  this  case  the  function  tt  is  also  unknown.  Now, 
of  course,  it  is  impossible  to  discover  the  numbers  z^,,-  by  solving  (15),  but 
if  it  is  postulated  that  the  function  w  is  strictly  monotone  increasing,  it  is 
still  possible  to  obtain  some  information  about  the  numbers  {a,ti  +  6 J, 
Since  tt  is  a  cumulative  distribution  it  is  monotone  increasing;  however,  it 
will  only  be  strictly  monotone  increasing  in  case  the  distribution  function 
D  is  never  zero.  This  assumption  is  made  explicit  in  Assumption  2. 

Assumption  2.  For  all  x,  D{x)  >  0. 

Now,  if  TT  is  strictly  monotone  increasingj  then  it  follows  that  wix)  >  ir{y) 
if  and  only  \i  x  >  y.  If  (14)  holds,  then  it  will  be  the  case  that  for  all  r,  s 
in  S  and  i,  j  =  1,  •  •  •  ,  k, 

(21)  ps.i  >  Pr,j     if  and  only  if     aJi  +  63  >  Qrt,  +  6,  . 

Therefore  from  an  ordering  on  the  numbers  p^.i  one  can  obtain  a  system  of 
inequalities  involving  the  constants  a^  ,  h^  ,  and  i,  .  If  it  is  further  specified 
(as  is  required  for  the  conditions  of  the  problem)  that  a,  >  0  for  all  S,  then 
this  set  of  inequalities  will  not  in  general  have  a  solution. 

However,  whether  or  not  a  set  of  data  fits  the  model  may  still  be  deter- 
mined. The  necessary  and  sufficient  condition  for  fit  is  that  there  exist  numbers 
a,  ,  ti  and  h,  (where  a,  >  0)  satisfying  the  system  of  inequalities  (21).  If 
this  set  of  inequalities  has  a  solution,  then  the  interval  boundaries  may  be 
taken  to  be  the  ti  satisfying  (21).  To  determine  the  scale  values  of  the  stimuli 
it  is  first  necessary  to  construct  a  distribution  function  which  can  represent 
the  data.  This  is  done  in  the  following  way.  A  differentiable  monotone  in- 
creasing function  7r(.T)  is  constructed  by  connecting  the  discrete  set  of  points 

T(aJi  +  6  J  =  ps,i 

with  any  smooth,  strictly  monotone  increasing  curve.  If,  as  is  usual,  there  is 
only  a  finite  number  of  stimuli,  then  such  a  curve  can  always  be  constructed. 
Finally,  the  distribution  function  D  is  defined  by 

(22)  D{x)=-~Tix). 
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Then,  if  the  mean  of  the  distribution  D  is  m,  the  values  v,  of  the  stimuli 
are  determined  by  (18),  v,  =  (m  —  6J/a,  .  As  far  as  the  determination  of 
the  V,  is  concerned,  it  can  be  seen  that  they  depend  solely  on  the  previously 
determined  a  and  h  and  on  the  mean  m,  which  can  be  regarded  as  an  additional 
arbitrary  constant  in  the  determination  of  the  v,  . 

The  remaining  point  of  discussion  for  this  model  is  the  determination 
of  the  degree  of  uniqueness  of  the  scale  values.  Finding  the  set  of  all  possible 
solutions  to  the  inequalities  (21)  presents,  in  general,  extreme  difficulty. 
One  thing  that  can  be  simply  determined  is  the  class  of  what  might  be  called 
the  universal  transformations  of  the  solutions  of  the  system  of  inequalities. 
A  universal  transformation  is  one  which,  applied  to  a  solution  of  any  set  of 
inequalities,  yields  another  solution  to  the  same  set  of  inequalities.  By  noting 
a  close  connection  between  the  theory  of  the  inequalities  (21)  and  a  two- 
dimensional  affine  geometry  with  a  distinguished  set  of  horizontal  and  vertical 
lines,  it  can  be  shown  [1]  that  the  class  of  universal  transformations  for  this 
model  is  a  subset  of  the  affine  transformations.  The  universal  transformations 
of  the  interval  boundaries  tf  are  the  linear  ones,  and  of  the  a,  are  multiplica- 
tions by  a  positive  constant.  The  6,  also  are  determined  up  to  a  linear  trans- 
formation, and  hence  so  are  the  scale  values  y,  (although  the  additional 
arbitrary  constant  m  also  enters  into  their  determination). 

There  is  also  an  interesting  special  case  in  which,  even  though  there  is 
only  a  finite  number  of  observations,  the  scale  values  of  the  ti  are  determined 
up  to  a  linear  transformation.  This  might  be  called  the  special  case  of  equal 
intervals,  in  which  differences  in  successive  ti  are  all  the  same.  If,  for  example, 
there  exist  stimuli  with  such  relations  among  corresponding  p's  as  p^,,  = 
Vy.i+i  =  Vz.i-^2  ,  Vx,i+i  =  Pv,i+2  ,  Py,i  =  Pz.i+i  ,  etc.,  it  is  possible  to  deter- 
mine that  successive  intervals  are  equal  [1]. 

The  fact  that  scale  values  obtained  in  this  model,  at  least  under  certain 
circumstances,  are  unique  up  to  a  linear  transformation  has  two  interesting 
consequences  for  the  original  successive  intervals  model  based  on  the  nor- 
mality hypothesis,  (i)  If  in  the  errorless  case  the  original  model  fits,  then 
no  other  successive  intervals  model  which  assumes  a  different  form  for  the  distri- 
bution functions  will  fit.  The  reason  for  this  is  that  the  forms  of  the  distribution 
functions  (or  the  cumulative  distributions)  are  determined  by  the  values  of 
p,,i  lying  above  the  point  ti  .  Hence,  if  the  ti  are  determined  up  to  linear 
transformation,  so  are  the  curves  p..,-  .  (ii)  Where  the  normality  assumption 
does  not  fit  the  data  it  is  theoretically  possible  to  use  the  present  generalization 
to  obtain  a  scale.  Then  the  deviation  of  the  scale  values  from  those  obtained 
under  a  normality  requirement  can  be  evaluated.  This,  at  least  in  principle, 
provides  a  second  kind  of  goodness  of  fit  besides  the  usual  least  squares 
regression  methods  employed  where  the  data  do  not  exactly  fit  the  Thurstone 
model. 
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DECISION  STRUCTURE  AND  TIME  RELATIONS  IN 
SIMPLE  CHOICE  BEHAVIOR* 

Lee  S.  Christie^  and  R.  Duncan  Luce^ 

Group  Networks  Laboratory, 
Research  Laboratory  of  Electronics, 
Massachusetts  Institute  of  Technology 

The  structure  of  simple  decisions  is  considered  in  terms  of  a  model 
which  composes  such  decisions  from  hypothetical  elementary  decisions. 
It  is  argued  that  reaction-time  data  can  be  treated  by  the  use  of  the 
Laplace  transform  so  as  to  overcome  difficulties  which  negated  earlier 
attempts  to  analyze  choice  reactions.  The  general  model  leads  to  com- 
plex problems  which  are  formulated  but  not  solved.  Two  special  csises 
of  the  model  are  worked  out,  and  the  statistical  problem  of  evaluating 
the  fit  of  the  model  is  discussed.  It  is  shown  that  treating  decision 
processing  as  time-discrete  leaves  the  essential  features  of  the  analysis 
unchanged.  Two  experimental  proposals,  to  provide  data  which  should 
be  considered  in  further  work  on  the  model,  are  made. 

I.  Introduction.  In  this  paper  we  propose  a  model  for  the  way  hu- 
man beings  organize  the  decisions  required  by  simple  choice  situa- 
tions into  a  collection  of  component  decisions.  It  is  our  thesis 
that  such  an  organization  of  decisions  must  be  reflected  in  the 
distribution  of  reaction  times  and,  therefore,  that  it  may  be  pos- 
sible to  infer  the  organization  from  the  reaction-time  distribution. 
Although  our  thinking  derives  from  empirical  studies,  we  must 
describe  this  proposal  as  speculative,  for  the  model  is  not  firmly 
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leased  on  such  studies.  However,  the  development  of  the  model 
has  led  us  to  suggest  two  experiments  which  we  believe  may  help 
to  determine  what  merit  it  has.  These  experiments  will  also  help 
to  decide  whether  it  is  desirable  to  pursue  further  work  in  an  at- 
tempt to  modify  the  model  to  accord  better  with  reality,  for  we  have 
little  hope  that  the  particular  details  of  the  present  model  have  any 
lasting  value. 

U.  Reaction  Times.  Suppose  that  a  subject  receives  a  stimulus  of 
a  fixed  type  at  time  0  and  responds  at  time  t  with  a  fixed  type  of 
response.  The  time  interval,  i,  between  the  stimulus  and  the  re- 
sponse is  called  the  simple  reaction  time.  If  the  subject  is  pre- 
sented with  one  of  a  set  of  stimuli  and  a  choice  of  response  con- 
tingent on  the  stimulus  is  required  the  corresponding  time  interval 
is  known  as  the  disjunctive  reaction  time.  In  either  case,  it  is 
clear  that  to  obtain  stable  and  readily  analyzable  time  distribu- 
tions it  is  necessary  that  the  stimulus  be  simple  enough  so  that 
the  mean  reaction  time  is  no  more  than  a  second  or  two.  Otherwise 
unwanted  stimuli  may  intervene  between  the  test  stimulus  and  the 
response,  and  the  interaction  among  the  stimuli  will  cause  a  dis- 
tortion of  the  time  distribution  which  will  be  very  difficult  to 
analyze. 

The  study  of  reaction  times,  including  disjunctive  reaction  times, 
has  a  long  history  in  the  literature  of  psychology  (cf.  Woodworth, 
1938,  chap.  xiv).  In  recent  years,  however,  relatively  little  in- 
terest has  been  evident  in  reaction-time  studies.  We  may  attribute 
this  loss  of  interest  to  two  related  causes.  First,  there  has  been 
a  failure  to  separate  the  time  to  make  a  decision  (decision  latency*) 
from  the  other  time  lags  involved  in  the  total  process.  One  attempt 
to  make  this  separation  involved  measuring  the  subject's  response 
to  a  stimulus  when  no  decision  was  to  be  made  and  subtracting 
this  time  from  the  time  required  to  respond  to  the  same  stimulus 
with  the  same  motor  action  when  a  decision  was  involved.  This 
technique  has  been  considered  unsatisfactory  for  the  following 
reason.  If  the  subject  has  no  decision  to  make  he  is  able  to  bring 
his   motor  readiness  for  the  specified  response  to  a  much  higher 

•  We  use  reaction  time  when  referring  to  the  time  of  a  process  timed 
from  stimulus  presentation  to  motor  response;  latency  when  referring  to 
times  of  distinguished  parts  of  such  a  process. 
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pitch  than  he  can  when  he  is  required  to  make  a  disjunctive  re- 
action; thus,  the  base  time — the  time  to  react  in  a  choice  situation 
excluding  the  time  for  the  decision  itself — cannot  be  equated  to 
any  simple  reaction  time.  We  may  conclude  that  the  base  time  will 
be  determined,  if  at  all,  only  from  measurements  taken  when  the 
subject  is  required  to  make  a  decision. 

Second,  suppose  that  in  one  way  or  another  the  pure  decision 
latency  distribution  has  been  obtained — then  what?  It  is  true  that 
if  these  distributions  were  found  to  be  extremely  simple,  in  that 
they  could  be  well  approximated  by  some  class  of  elementary 
mathematical  functions,  the  separation  of  non-choice  latencies 
(base  times)  from  decision  latencies  might  be  an  end  in  itself.  If, 
however,  the  resulting  decision  latency  distribution  were  of  a 
complex  character,  the  challenge  to  account  for  it  in  more  primitive 
terms  would  remain. 

We  describe  these  as  related  difficulties,  for  it  is  not  unreason- 
able to  suppose  that  the  method  used  to  tease  out  the  non-choice 
latencies  (base  times)  can  also  be  used,  or  adapted,  to  decompose 
the  decision  latencies  into  more  primitive  terms.  Such  a  decom- 
position of  the  observed  reaction-time  distribution  may  be  an  en- 
tirely formal  mathematical  process  with  no  empirical  correlate  or  it 
may  be  based  on  a  model  which  purports  to  describe  the  way  a 
human  being  composes  the  finarlly  observed  decision  from  certain 
more  elementary  ones.    It  is  with  such  a  model  that  we  are  concerned. 

At  the  heart  of  our  proposal  is  the  idea  that  the  mathematical 
technique  of  the  Laplace  transform  may  be  employed  usefully  in 
the  study  of  reaction  times.  Since  it  is  unlikely  that  every  one  of 
our  readers  will  be  familiar  with  the  Laplace  transform,  we  have 
devoted  the  next  section  to  its  definition  and  to  a  list  of  those  of 
its  elementary  properties  which  we  shall  need. 

III.  The  Laplace  Transform.  Let  F  be  a  real-valued  function  of  a 
real  variable  t  such  that  F{t)  -  0  for  ^  <  0.  The  real-valued  func- 
tion L{F)  of  the  real  variable  s  defined  by  the  equation 

e-''F{t)dt  (1) 

is  called  the  Laplace  transform  of  F.  There  is  essentially  no  loss 
of  information  about  F  in  making  this  transformation  [see  equation 
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(4)],  but  because  of  some  of  the  special  properties  of  the  transform 
there  is  sometimes  a  distinct  advantage  to  working  v/ith  trans- 
formed functions.  We  shall  List  a  few  of  the  elementary  properties 
of  the  transform  which  we  shall  need  later;  no  proofs  will  be  given 
for  they  are  well  known  (cf.  Churchill,  1944). 


1. 


l\J  F^(T)F^(t~r)dr\  =  L(F^)L(F^).  (2) 

ii.  L  f^y  sL(F)  +  F(0) .  (3) 

iii.    If  L(F)  =  L(G),  then  F  =  G  +  N,  where  N  is  some  (4) 

function  with  the  property  f^  N(t)dt  ^  0  for  all  r>  0.  If  it  is  known 
that  F  and  G  are  continuous,  the  A'  is  continuous  and  so  N  =  0,i.e., 
F  =  G. 

iv.    If  a  and  b  are  constants, 

L(aF  +  bG)  =  aL(F)  +  bL(G) .  (5) 

V.    If  F(i)  =  \e~     ,  where  A  is  a  constant,  then 

UF)  =  -^—  .  (6) 

IV.  The  Model.  Our  proposal  is  based  on  assumptions  which  are 
intuitively  acceptable,  but  which  at  the  moment  do  not  appear  to  be 
susceptible  of  direct  verification.  It  is  our  impression  that  any 
empirical  verification  of  the  model  must  deal  with  the  full  set  of 
assumptions  rather  than  with  each  in  isolation. 

Assumption  I.  It  is  possible,  for  a  given  experimental  situation, 
to  divide  the  observed  reaction  time  t  into  two  latency  components 
t^  and  t^,  called  base  time  and  choice  time  respectively,  such  that: 

1.  t  =  t+  t    . 

o  c 

2.  The  value  of  t^  depends  only  on  the  mode  of  stimulus  pres- 
entation and  on  the  motor  actions  required  of  the  subject.  Specif- 
ically, it  is  not  directly  dependent  on  the  character  of  the  choice 
demanded. 

3.  The  value  of  i  depends  only  on  the  choice  demanded.  Spe- 
cifically, it  is  not  directly  dependent  on  the  mode  of  stimulus  pres- 
entation or  the  motor  actions  required. 

Let  the  distributions  of  t,  t^,  and  t  be  denoted  by  f,  /^,  and  /^ 
respectively.      Since  conditions  2   and  3  imply  that  the  two  com- 
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ponent   latencies   are  independent  for  a  fixed  experimental  situa- 
tion, it  follows  from  condition  1  that 


(7) 


Our  second  major  assumption  concerns  only  the  choice  latencies 
and  requires  the  distribution  /  to  be  composed  from  more  elemen- 
tary distributions.  The  basic  idea  is  that  the  final  decision  made 
by  a  person  is  organized  into  a  set  of  simpler  decisions  which  are, 
in  some  appropriate  sense,  elementary  decisions  built  into  him.  If 
such  a  structure  exists  in  human  decision  making,  it  is  analogous 
to  the  structure  of  a  decision  process  in  a  computing  machine, 
which  may  be  thought  of  as  composed  from  a  set  of  decisions 
which  are  elementary  relative  to  that  machine,  i.e.,  the  elementary 
decision  capabilities  built  into  the  machine  by  the  engineer.  The 
actual  organization  of  these  elementary  decisions  to  form  a  more 
complex  one  is  a  function  both  of  the  individual  man  or  machine 
and  of  the  nature  of  the  decision  being  made.  This  is  true  at  least 
of  the  machine,  and  we  shall  suppose  it  is  true  of  human  beings. 
In  addition,  the  breakdown  of  a  complex  decision  is  not,  in  gen- 
eral, restricted  to  a  serial  process  where  one  elementary  decision 
is  followed  by  another,  for  in  a  macKine  different  portions  may  be 
simultaneously  employed  on  different  parts  of  the  problem.  There 
seems  every  reason  to  suppose  this  is  also  true  in  a  human  being. 

We  shall  describe  the  organization  of  decisions  by  a  directed 
graph,  (The  terms  oriented  graph  and  network  have  also  been  em- 
ployed in  the  mathematical  literature  and  the  term  flow  diagram  is 
used  in  connection  with  computer  coding.)  A  directed  graph  con- 
sists of  a  finite  set  of  points  which  are  called  nodes^  with  directed 
Lines  between  some  pairs  of  them.  Several  examples  are  shown  in 
Figure  1.  It  is  possible,  in  general,  for  more  than  one  directed 
line  to  connect  two  points,  both  in  the  sense  that  we  may  have  two 
or  more  in  the  same  direction  as  in  Figure  2a,  and  in  the  sense 
that  there  may  be  Lines  with  opposite  directions  as  in  Figure  2b. 
In  this  paper,  when  we  use  the  term  directed  graph,  we  shall  sup- 
pose that  neither  of  these  possibilities  is  allowed,  that  is,  we 
shall  suppose  that  between  any  pair  of  nodes  there  is  at  most  one 
directed  Line. 
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Figure  1 

We  shall  employ  a  directed  graph  to  represent  the  organization  of 
decisions  in  the  following  way:  At  each  node  we  shall  assume  that 
an  "elementary  decision**  will  take  place,  the  latency  distribution 
governing  the  decision  at  node  i  being  denoted  by  /^,  The  decision 
process  is  initiated  at  node  i  when,  and  only  when,  decisions  have 
been  made  at  each  of  those  nodes  ;  such  that  there  is  a  directed 
line  from  ;  to  u  We  may  think  of  the  "demon**  at  node  i  waiting  to 
begin  making  his  decision  until  he  has  received  the  decisions  of 
all  the  demons  who  precede  him  in  the  directed  graph. 

For  the  directed  graphs  we  shall  consider,  there  will  be  at  least 
one  node,  possibly  more,  which  is  the  terminal  point  of  no  line; 
these  will  be  the  decision  points  which  are  activated  by  the  ex- 
perimental stimulus  at  time  0,  There  will  also  be  at  least  one 
node,   and  again  possibly  more,  which  initiates  no  directed  line. 


Figure  2 
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and  it  is  only  when  the  decisions  at  all  these  nodes  have  been 
taken  that  the  motor  actions,  which  signal  the  subject's  response 
to  the  experimenter,  are  begun.  It  is  clear  that  for  any  individual 
and  for  any  stimulus  situation  it  is  possible  to  find  at  least  one 
directed  graph  N  and  elementary  latencies  /.  which  compose  as 
described  above  to  give  f  .  For  example,  let  N  have  but  one  node 
and  let  f.  =  f  >  We  shall,  however,  make  stringent  assumptions 
about  N  and  /  which,  in  general,  exclude  this  trivial  solution.  It 
is  some  of  these  assumptions  which  most  likely  will  be  abandoned 
or  modified  if  the  present  model  cannot  cope  with  experimental 
data. 

Assumption  II.  It  is  possible  to  find  for  each  stimulus  situation, 
CT,  a  set  of  stimulus  situations,  5,  which  all  have  the  same  base- 
time  distribution,  /^,  and  an  elementary  decision  latency,  Z^,  such 
that: 

1.  a  is  an  element  of  5. 

2.  For  each  choice  situation  p  in  S  there  exists  a  directed  graph 
A'    with  the  properties, 

a.  each  of  the  latency  distributions  at  the  nodes  is  the  same, 
namely,  /^, 

b.  the  decision  time  at  node  i  is  independent  of  that  at  node 

c.  /^  is  a  composition  of  A'    and  /^  (as  described  above). 

3.  Among  the  stimulus  situations  in  S  there  is  one  whose  di- 
rected graph  satisfying  conditions  II.2  is  a  single  point. 

In  less  formal  terms,  we  require  that  there  be  groups  of  stimulus 
situations  all  of  which  have  the  same  base-time  distribution  and 
which  can  be  built  up  according  to  a  directed  graph  from  elemen- 
tary and  independent  decisions  which  all  have  the  same  latency 
distribution  /  .  In  addition,  among  the  stimulus  situations  in  this 
class  we  assume  that  there  is  one  which  employs  but  a  single  ele- 
mentary decision.  The  latter  assumption  can  be  weakened,  if  we 
choose,  to  the  assumption  that  there  is  one  stimulus  situation 
whose  directed  graph  we  know  a  priori,  but  in  what  follows  we 
shall  take  the  stronger  form  that  the  graph  is  a  single  point. 

V.  Comments.  The  above  assumptions  comprise  the  formal  struc- 
ture of  our  model;  there  are  a  series  of  auxiliary  comments  which 
are  necessary. 
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Even  if  we  were  able  to  show  that  these  assumptions  can  be  met 
for  certain  wide  classes  of  experimental  data,  but  that  in  so  doing 
we  obtain  elementary  decision  distributions  /  which  are  extremely 
complicated,  it  is  doubtful  that  we  should  accept  the  model  as  an 
adequate  description  of  the  decision  process.  Equally  well,  if  the 
directed  graphs  required  are  excessively  complex  we  should  reject 
the  model.  The  hope  is  that  it  is  possible  to  subdivide  the  total 
process  into  a  relatively  small  set  of  subprocesses  which  are 
practically  identical.  But  we  do  not  want  to  be  forced  to  an  analy- 
sis in  terms  of  individual  neurone  firings.  It  is  probable  that  As- 
sumption II. 3  effectively  prevents  this  extremity  by  requiring  the 
existence  of  a  stimulus  situation  which  demands  but  one  elemen- 
tary decision  for  its  response. 

It  is  also  implicit  in  our  thinking,  although  not  a  part  of  the 
formal  model,  that  the  sets  S  of  "similar"  stimulus  situations  will 
include  as  subsets  those  experimental  situations  we  naturally 
think  of  as  being  similar.  For  example,  suppose  the  subject  is 
presented  with  n  points,  one  of  which  is  colored  differently  from 
the  others  and  he  is  required  to  signal  the  location  of  that  one. 
VVe  should  want  to  consider  as  "similar"  the  set  of  these  situa- 
tions generated  as  n  ranges  over  the  smaller  integers.  VVe  should 
probably  reject  the  model  if  they  could  not  be  put  in  the  same  set 
S,  even  if  by  great  ingenuity  we  were  able  to  find  other  less  in- 
tuitively simple  sets  of  situations  for  which  the  model  held. 

When  the  model  is  applied  to  experimental  data  we  anticipate 
that  the  case  of  the  directed  graph  being  a  single  point  will  be 
identified  with  the  intuitively  "simplest"  choice  situation  within 
the  set  of  "similar"  ones. 

In  some  of  the  following  sections  we  shall  make  the  following 
explicit  assumption  as  to  the  form  of  /  : 


fM)  ^ 


\e-'^\  t  >  0, 
[0         ,  ^5  <  0, 

where  A  is  a  positive  constant.  There  are  two  grounds  for  sup- 
posing this  might  be  an  appropriate  assumption.  First,  let  us  sup- 
pose that  when  no  decision  has  been  reached  by  time  t  following 
stimulation  at  time  0  then  the  probability  that  the  decision  will  be 
reached  between  t  and  t  +  A^,  where  A^  is  small,  is  approximately 
proportional  to  A^,  with  a  constant  of  proportionality  A.  In  this 
case,  it  is  not  difficult  to  show  that  the  distribution  of  decisions 
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is  exponential  (Christie,  1952a,  b).  Whether  this  assumption  is 
correct  is  an  empirical  problem,  but  it  must  be  admitted  that  it  has 
the  virtue  of  simplicity.  Second,  and  probably  more  relevant,  it  is 
a  relatively  common  observation  that  as  certain  decision  situations 
are  made  more  and  more  simple,  the  observed  latency  is  better  and 
better  approximated  by  an  exponential  distribution  slightly  dis- 
placed from  the  origin  (Christie,  1952b;  Luce,  1953).  The  main 
error  is  generally  on  the  rising  limb.  If  this  change  toward  sim- 
plicity is  actually  toward  a  directed  graph  consisting  of  one  point, 
and  if  our  other  assumptions  hold,  then  it  seems  plausible  that  the 
elementary  decision  latency  is  actually  exponential  but  that  the 
observed  distribution  is  smeared  by  the  convolution  of  the  base- 
time  distribution  and  the  decision-time  distribution. 


VI.  The  Problem.  Let  <S  be  a  set  of  choice  situations  which  are 
presumed  to  satisfy  the  assumptions  of  the  model,  i.e.,  S  is  a  set 
of  the  type  described  in  Assumption  II.  Let  /^  denote  the  reaction- 
time  distribution  associated  with  a  typical  member  of  S.  The 
problem  is  then  to  find  distributions  /^  and  /  and  a  set  of  directed 
graphs  /V^,  where  a  ranges  over  S,  such  that  each  of  the  triples 
(fiy  f  >  ^cr)  when  composed  according  to  the  assumptions  of  Section 
IV  yields  the  distribution  /^.  There  may,  of  course,  be  no,  one,  or 
many  solutions  to  the  problem,  but  one  hopes  that  by  an  appropriate 
choice  of  S  there  will  be  exactly  one  solution. 

It  would  appear  that  if  the  problem  is  to  be  solved  in  any  degree 
of  generality,  it  must  be  attacked  somewhat  indirectly.  It  may 
prove  appropriate  to  solve  first  the  following  problem:  Given  a 
continuous  distribution  /,  find  the  set  of  all  triples  (/j^,  f^,  A/), 
where  f^  and  /^  are  continuous,  which  satisfy  the  assumptions  and 
which  compose  to  form  /.  It  seems  very  plausible  to  suppose  that, 
in  general,  there  are  many  solutions  to  this  problem.  However,  if 
/  and  /'  are  two  distributions  associated  with  choice  situations 
from  the  same  set  S,  then  it  will  be  necessary  to  accept  only  those 
triples  with  the  same  /.  and  /  present  in  both  cases.  Further 
stimulus  situations  should  serve  further  to  restrict  the  possibilities. 

These  problems  will  not  be  attacked,  let  alone  solved,  in  this 
paper;  they  appear  to  be  of  considerable  difficulty.  We  know  of 
only  one  important  lead  in  this  direction,  but  we  have  not  investi- 
gated it.  In  recent  years,  electrical  engineers  have  been  concerned 
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with  the  problem  of  synthesizing  in  a  systematic  manner  electrical 
networks  to  have  preassigned  transfer  functions.  If  we  identify  the 
given  reaction-time  distribution  with  the  transfer  functions,  the 
graph  N  with  the  electrical  network,  and  /  with  component  char- 
acteristic, there  is  an  analogy  between  the  two  problems.  This  is 
probably  worth  investigation,  but  it  is  almost  certain  that  solving 
our  problem  will  prove  to  be  a  major  research  undertaking. 

To  some  extent  the  problem  we  pose  may  be  simplified  by  using 
some  of  our  assumptions  and  the  Laplace  transform.  Let  f^  be  the 
observed  distribution  of  reaction  times  for  a  given  stimulus  situa- 
tion a^  then  by  Assumption  II  we  know  there  exists  a  set  S  which 
includes  a  and  another  stimulus  situation  whose  directed  graph 
consists  of  one  point.  Let  /j  denote  the  distribution  of  reaction 
times  in  the  latter  case.    From  Assumption  I  we  may  write 


(8) 
/i(^)=  f   ft,(r)fjt-r)dr. 

Taking  the  Laplace  transform  in  each  case  and  applying  equation 
(2), 

L(0  =  L(4)L(4)  ,  ^g^ 

If  we  divide  the  first  equation  by  the  second  in  equation  (9),  we 
obtain 


Up  _  L{Q 

L{A)     UL) 


(10) 


This  is  a  fairly  crucial  consequence  of  our  assumptions,  for  it 
is  seen  that  all  mention  of  the  base  time  has  been  eliminated.  It 
is  an  equation  relating  the  empirical  data  to  /  and  A'^. 

At  this  point  we  should  raise  an  important  practical  problem. 
Empirically,  one  does  not  obtain  estimates  of  the  distribution  /, 
but  rather  approximations  to  the  cumulative  distribution 


F(t)  =  f  f(r)dr  . 
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(Throughout  we  shall  use  small  Latin  letters  to  denote  distribu- 
tions and  the  corresponding  capitals  to  denote  their  cumulatives.) 
Now,  while  approximations  to  F  may  be  reasonably  accurate,  it  is 
well  known  that  numerical  differentiation  of  data  tends  to  magnify 
errors  and  is,  therefore,  to  be  avoided.  So  the  question  arises 
whether  we  can  translate  our  results,  in  particular  equation  (10), 
into  statements  about  the  cumulative  distributions.  From  equation 
(3)  we  have 

LU)  =  sL{F)  +  F{Qi)  , 

Since  we  are  speaking  of  empirical  data  we  may  assume  F(0)  =  0, 
and  so  equation  (10)  becomes 

^^  =  H41.  (11) 

Having  eliminated  /^  from  our  discussion,  the  problem  of  deter- 
mining it  remains.  Since  our  division  in  equation  (11)  assumes 
/,  is  the  same  in  the  several  cases,  it  will  suffice  to  determine  it 
from  any  cne.  The  simplest,  of  course,  is  the  case  where  the  graph 
consists  of  one  point,  in  which  case 


UU)  _  L(F,) 


LiQ  =  ^  .  ^llilA  .  (12) 


As   an  example  of  how  equation  (12)  may  be  used,  suppose  /^  is 
exponential  with  time  constant  X .  Then  by  equation  (6) , 


s      1 


and  so  equation  (12)  becomes 

If    we   make   the  reasonable   assumption   that  /j(0)  =  0,   then  from 
equations  (3)  and  (5)  we  find 
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Assuming   that  f^  is  continuous  and  that  f^   has  a  continuous  de- 
rivative, equation  (4)  implies 


_  1  df^ 


or  integrating  from  0  to  ^ , 


1 


Since  f^  must  be  determined  from  empirical  data,  it  is  clear  from 
equation  (13)  that  considerable  data  will  be  necessary  to  obtain 
accurate  estimates  of  F^  . 

VII.  Serial  Decision  Process.  An  alternative  program  to  solving 
the  general  problem  discussed  in  Section  VI  is  to  discover  the 
consequences  of  certain  explicit  assumptions  about  the  directed 
graph  A'  and  the  elementary  latency  /  .  The  results  of  this  alterna- 
tive program  will,  unfortunately,  be  much  weaker  than  a  solution  of 
the  general  problem,  but  they  may  have  considerable  heuristic 
value.  We  may  choose  such  extra  assumptions  on  intuitive  grounds, 
with  the  hope  that  they  may  be  relevant  for  some  experimental 
data.  We  shall  examine  two  cases  which  are,  in  a  sense,  the  two 
most  extreme  forms  of  the  directed  graph  N,  The  first,  the  topic  of 
this  section,  is  the  general  serial  case  shown  in  Figure  3a,  and 
the  second,  which  will  be  discussed  in  Section  VIII,  is  the  parallel 
case  shown  in  Figure  3b, 

a  Stimulus  • ►• ^-•- -• •^ 


Stimulus 


FIGURE   3 
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It  follows  immediately  from  Assumptions  I  and  II. 2. b  that  the 
observed  distribution  /   of  a  serial  process  having  n  nodes  is  given 

by 

4^^) =['•••  rr 4^)4^^ -^i). . -4^^ - tjdt.dt, ...dt^.  (14) 

Applying  the  Laplace  transform  to  equation  (14)  and  using  equation 
(2)  we  have 

or  dividing  by  the  case  n  =  1, 

£^.,(/)-.^:l^.  (16) 

Equation  (16)  is  the  explicit  form  of  equation  (11)  for  the  serial 
case.  Clearly,  if  we  have  given  numerical  data  we  may  determine 
(possibly  numerically)  /   for  each  value  of  n. 

As  an  example  of  how  this  might  be  done  when  we  know  the 
general  form  of  /  ,  suppose  /  is  exponential  with  the  time  constant 
A  .    In  that  case,  equation  (16)  becomes 

UF)  1 


M^.)-(A.i) 


71-1 


(17) 


1  s 

In   Figure  4  we  have  presented  plots  of    —  vs.  —  for  small 

/ 1  +  1  r        '^ 

values  of  n .  \X  j 

A  second  equation  may  be  obtained  by  observing  that  the  mean, 
^j(n),  of  a  serial  process  with  n  exponential  elementary  decisions 
is  given  by 

^j(n)  =  ^j(6)  +^,  (18) 

where  ii^{h)  is  the  mean  base  time.    Thus, 

,z,(n)-Mi(l)  =  ^.  (19) 

We  may  now  use  equations  (17)  and  (19)  to  attempt  to  decide 
whether  a  given  set  of  data  is  adequately  fit  by  the  assumptions  of 
the  model,  plus  the  added  assumptions  of  a  serial  directed  graph 
and  exponential  elementary  latencies.  There  are  serious  statisti- 
cal questions  as  to  how  this  may  best  be  done,  but  the  following 
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FIGURE   4 


ready  method  may  suffice  until  the  statistical  problems  are  formu- 


lated and  solved.    From  the  data  we  compute 


as  a  function  of 


s;  this  we  may  assume  is  in  the  form  of  a  plot,  which  we  shall  call 
plot  A»    For  each  (reasonable)  value  of  n  and  for  some  value  of  y* 

5  1  1 

say  Y  =  ^,  find  in  Figure  4  the  corresponding  value  of 


(x^r 


We  know  from  equation  (17)  that  this  must  be  equal  to 


HP,) 


if  our 


assumptions  are  correct  and  if  the  correct  value  of  n  has  been 
chosen.  We  thus  enter  plot  A  at  this  point  and  determine  the  value 
of  s .  Since  we  selected  A  =  2s  ,  this  determines  A ,  But  equation 
(19)  presents  a  relation  between  the  observed  means,  A,  and  n 
which  will  be  satisfied  if  our  assumptions  are  valid.  We  choose 
the  value  of  n  such  that  the  error  between  the  observed  means  [the 
left  side  of  equation  (19)]  and  (n  -  1)/A  is  a  minimum;  this  yields 
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the  best  possible  fit  at  the  point  s/X  =  1/2  for  the  model  with  the 
added  assumptions  of  a  serial  graph  and  exponential  / .  Using 
these  values  of  A  and  n ,  one  may  add  the  theoretical  curve 
± vs.    s   to  plot  A,  and  a  comparison   between  the  two 

'  n— 1 


M 


curves  will  give  some  indication  of  the  adequacy  of  the  assump- 
tions. Clearly,  a  less  subjective  criterion  of  the  quality  of  this 
fit  is  needed. 

VIII.  Parallel  Decision  Process.  If  we  suppose  that  the  n  elemen- 
tary decision  processes  are  carried  out  in  parallel  (see  Figure  3b), 
the  choice  latency  distribution  is  the  distribution  of  the  largest  of 
n  selections,  one  from  each  of  the  elementary  distributions.  This 
is  known  to  be  given  by 

d      ^ 

at  I  =  1 

which  in  the  case  when  all  the  elementary  distributions  are  the 
same,  namely  F  ,  reduces  to 

If  we  denote  the  observed  reaction-time  distribution  for  the  parallel 
case  by  g  ,  then  it  follows  from  equation  (7)  that 

ffn(i)  =  fU^)nfp  -  r){F^{t  -  T)Y-'dr  .  (20) 

Applying  the  Laplace  transform  and  equation  (2) , 

Ug^)  =  L(0L(nf//-')  .  (21) 

As  before,  we  may  divide  by  L(g^)  to  eliminate  L  (ff^)  . 
To  proceed  further,  we  assume  f  is  exponential,  then 

Linf^F^"-^)  =  nk  f    e'^'e-'^'il  -  e-^'V^  dt  , 

=  nX   r e-^'^'^^'J^  (-XM(-l)*«-*^'cf^  , 


=  n>'   (^f)(-l)' 


fc-o  T-  +  A  +  1 
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To  evaluate  the  above  sum,  consider  the  function 


n~\ 


k=0 

Observe  that 


t>(x)  =  2]  C'-k')(~^)^cc^-x^  -x'^-(i  -xy-' . 


^0  k  =0  ^0 

fc=o 
=  L{nfFr'). 


'     i\^k) 


X         '  dx  ^ 


A;  =0  T-  +  /^  +  1 

A 


and  that 


nl      ^(x)dx  =  n         x^^"^  (1  -  x)''-'^  dx  , 

nr(|-  +  i)r(n) 

where  S(;72,  n)  is  the  Beta  function  and  r(n)  is  the  Gamma  function. 
From  these  results  we  easily  obtain 


nBjY  +  1,  n\ 


r(^  +  n  +  i 


(l  ^  ^  -^  ^) 


(22) 


n\r(j  +  2J 
In  Figure  4  we  have  also  presented  plots  of  — -. r-  vs.  y 

for  small  values  of  n.  i  A  / 
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The   mean  of  the  parallel  process  can  be  shown  to  be  given   by 
/i,(n)  =  ^j(6)  +  T-  /     —  and  thus  we  have,  as  in   the  serial  case,  a 

i  =  \ 

second  relation  which  must  be  met 

^j(n)-^j(l)  =  ^^  J,  (23) 

1=2 


The  procedure  for  curve  fitting  is  the  same  as  described  for  the 

s 
X 


s 
serial  case  except  that  y  =  1  seems  to  be  a  more  favorable  place 


s       1 
to  enter  the  graph  than  is  y  =  ^  . 


IX.  Model  Selection,  Without  a  solution  to  the  general  problem 
described  in  Section  VI,  there  arise  statistical  problems  as  to  how 
well  a  particular  set  of  assumptions,  such  as  serial  directed  graph 
and  exponential  /  ,  fit  the  data  and  whether  another  set  of  similar 
assumptions  is  better  or  not.  In  addition,  within  any  one  set  of 
assumptions  there  are  undetermined  constants,  such  as  A  and  n, 
and  there  is  a  question  as  how  best  to  choose  them.  We  have  indi- 
cated one  procedure  (end  of  Section  VIl)  to  determine  the  con- 
stants, but  it  is  almost  certain  that  such  an  ad  hoc  procedure  is 
not  optimal. 

The  difficulty  of  making  a  selection  among  different  sets  of  as- 
sumptions is  evidently  quite  serious  for  it  can  be  seen  from  Figure 
4  that  for  almost  any  small  value  of  n  in  one  there  is  an  n  in  the 
other  such  that  the  two  curves  are  fairly  similar.  Presumably,  any 
other  directed  graph  will  produce  curves  which,  in  some  sense,  lie 
between  these  two  extreme  cases.  Thus,  the  shape  of  the  empirical 
data  curves  will  not  be  extremely  revealing  of  the  proper  directed 
graph  to  use — an  unfortunate  situation. 

It  is  clear  that  there  are  a  number  of  difficult  statistical  prob- 
lems here,  but  in  all  Likelihood  it  will  prove  to  be  more  efficient 
first  to  do  some  experimental  exploring  using  subjective  judgments 
as  to  goodness-of-fit  before  trying  to  formulate  and  to  solve  the 
statistical  problems. 

X.  The  Perceptual  Moment.  In  Section  II  we  remarked  that  in 
reaction-time  studies  the  mean  reaction  time  should  be  of  the  order 
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of  one  second  if  unwanted  interactions  with  other  stimuli  are  to  be 
avoided.  This  means  that  the  data  will  be  in  a  range  where  certain 
peculiar  phenomena  have  been  observed.  To  explain  these  ob- 
servations, it  has  been  proposed  that  a  subject  processes  informa- 
tion very  rapidly  at  certain  discrete  times  and  that  he  is  in  a  re- 
fractory period  between  them.  The  period  from  the  beginning  of 
one  such  hypothetical  event  to  the  beginning  of  the  next  has  been 
termed  the  perceptual  moment  (Stroud,  1949a,  b).  Unfortunately, 
relatively  little  direct  experimentation  has  been  conducted  on  this 
problem,  and  so  it  is  not  possible  at  this  time  to  give  a  formal 
statement  of  the  properties  of  the  moment.  Indeed,  there  are  in- 
vestigators who  doubt  its  existence.  In  the  case  that  it  does  exist, 
our  analysis  will  be  applied  to  situations  where  it  most  probably 
will  have  an  effect.  It  is,  therefore,  of  interest  whether  the  analy- 
sis can  be  adapted  to  cope  with  it.  In  this  section  we  shall  make 
a  simple  hypothesis  as  to  the  nature  of  the  moment,  not  with  any 
belief  that  it  is  correct,  but  only  to  indicate  that  the  general  fea- 
tures of  the  analysis  remain  unchanged. 

Let  us  assume  the  moment  is  of  fixed  duration,  say  8  seconds, 
and  that  while  a  person  may  receive  information  at  any  time  during 
that  period  it  will  only  serve  as  a  stimulus  at  the  end  of  the  pe- 
riod. Furthermore,  we  will  assume  that  all  intermediate  (elemen- 
tary) decisions  occur  at  multiples  of  8 .  Since  we  may  assume  that 
there  is  no  correlation  between  the  stimulus  presentation  and  the 
timing  of  the  moment,  we  may  assume  the  stimulus  is  presented  ac- 
cording to  a  uniform  distribution  h  in  the  interval  0  to  5.  This 
assumption  may  be  inappropriate,  for  it  may  happen  that  a  person 
is  only  able  to  assimilate  information  during  part  of  the  moment; 
we  shall  return  to  this  point  later. 

The  question  now  arises  as  to  the  discrete  form  we  should  as- 
sume for  the  elementary  decision  process.  In  the  continuous  case 
we  took  it  to  be  exponential,  and  so  we  shall  use  the  discrete 
analogue.  We  assume  that  if  no  decision  has  been  reached  by  the 
ith  moment  following  the  presentation,  i.e.,  at  time  iS ,  then  the 
probability  of  a  decision  in  the  ith  moment  is  X5 .  If  we  call  the 
probability  of  a  response  by  the  ith  moment  P. ,  then 

^i  =^i-,+  [l-P<-.]^S  .  (24) 

=  (1  -X5)P,_,  +A5  . 
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With    the  initial  condition  P^  =  0  ,  the  difference  equation  (24)  is 
solved  by 

p^  =  1  -  (1  -  xsy  . 

The  probability  of  a  decision  in  the  ith  moment  is  obviously 

[1^P._JA5; 
hence,  we  have 

X5(l-A5)'-i  ,  (25) 

as  our  distribution  /   . 

If    we   replace   this   discrete   distribution,   equation    (25),    by  a 
continuous    one   <^^  which   has   rectangles   of  width   e  and  height 

— ^ —  centered  about  the  point  i8 ,  then  it  is  clear  that  in 

the  limit  as  f  — >  o   this  becomes  the  discrete  distribution. 

Let  the  base-time  distribution  be  denoted  by  /^  as  before,  then 
the  observed  data  in  the  discrete  serial  case  is  given  by 


f ->o 


/'■••/'/ 


f^(t)  =  Urn         ...        /    f,(t,)h(t2-^i)%(h-i2)-" 


(26) 


Applying  the  Laplace  transform  and  using  equation  (2) , 

L  (fj  =  Urn  L  (/j)  L{h)  L  {^^^  =  L  (f^)  L(h)\  lim  L  ($,)]«   .       (27) 

Observe, 


2 


.2.  e  _i  f 

2  2  °° 
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A5(l-A5)-'  f^l(l-X8)e-'^' 
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But, 


so, 


lim    =  1  , 


Lim  L  (0^)  = 5-   . 

^       1  -(l_A5)e-^^ 


Substituting   in  equation  (27)  and  dividing  by  the  case  n  =  1,  we 
have 

Z(^)  =[l_(l-.A5)e-^sJ  '  (28) 

which   is  the  crucial  equation  for  the  discrete  serial  case.     The 
mean  of  the  discrete  distribution  /    is  given  by 

^   i5A5(l  -kSy-^  =j  .  (29) 

Thus,  the  relation  between  observed  means  is 

//j(n)-^,(l)=^^.  (30) 

Now,  if  we  know  the  value  of  5,  i.e-,  the  length  of  the  moment, 
then  these  two  sets  of  equations  may  be  used  in  exactly  the  same 
fashion  as  were  equations  (17)  and  (19)  of  Section  VU.  We  have 
no  theoretical  value  of  5 ,  so  it  will  be  necessary  to  perform  in- 
dependent measurements  of  it.  It  is  clear  that  if  the  perceptual 
moment  is  a  real  phenomenon  it  will  be  important  to  ascertain  its 
properties  prior  to  analyzing  experiments  on  reaction  time. 

One  further  comment  of  some  interest:  If  we  ignore  /^  and  let 
71  =  1,  the  convolution  of  h  and  0^,  when  ( — >■  0,  is  a  step  function 
such  as  that  shown  in  Figure  5.  The  convolution  of  this  function 
with  f^  ,  for  reasonable  /.  ,  will  serve  to  smear  the  steps  but  it  will 
not  utterly  destroy  them.  Smearing  will  also  result  if  n  is  larger 
than  1,  the  amount  depending  on  the  value  of  n.  Thus,  if  our  as- 
sumption as  to  the  moment  is  roughly  correct,  we  should  expect, 
at  least  for  comparatively  simple  situations,  to  find  the  observed 
latency  distribution  somewhat  lumpy.  Indeed,  in  the  Literature  (cf. 
Woodworth,  1988)  it  has  been  remarked  not  only  that  the  data  are 
lumpy  but  that  there  is  an  oscillation  superimposed  on  the  distribu- 
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tion  curve.  This  effect  could  easily  be  obtained  analytically  if  we 
were  to  assume  h  uniform  over  only  a  small  portion  of  the  interval 
0  to  5,  in  other  words,  if  we  assume  the  vast  majority  of  the  mo- 
ment is  truly  a  refractory  period  during  which  there  is  no  intake  of 
information.  These  considerations  bring  out  even  more  strongly  the 
need  for  comprehensive  experiments  to  determine  the  properties  of 
the  moment. 

We  shall  not  attempt,  as  before,  to  study  the  parallel  case.  The 
reasons  are  that  the  mathematical  problem  is  rather  complex  and 
with  so  little  information  on  the  nature  of  the  moment  it  hardly 
seems  worthwhile  to  carry  out  the  analysis.  Furthermore,  we  are 
of  the  opinion  that  it  is  unlikely  that  information  accepted  in  dif- 
ferent moments  is  dealt  with  other  than  serially.  It  may  happen, 
however,  that  the  information  accepted  in  one  moment  is  processed 
in  parallel.  The  latter  remark  is  a  possible  hint  for  developing  an 
explanation  of  the  effect  of  changing  the  number  of  "psychological 
dimensions*'  in  an  information  display. 


XI,  Experimental  Proposals.  The  key  assumption  in  our  analysis 
is  that  elementary  decision  processes  can  be  found  of  such  a  sort 
that  complex  decisions  can  be  built  up  from  them  in  a  way  which 
leaves  their  characteristic  A  value  invariant.  One  should  like  to 
present  experimental  subjects  with  stimuli  which  vary  in  several 
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dimensions  but  for  which  decisions  on  each  of  the  dimensions  have 
identical  time  characteristics.  If  one  uses  conceptually  different 
dimensions,  we  run  into  the  difficulty  of  possibly  introducing 
several  different  A  values.  If  we  use  several  objects  with  the 
same  dimension  relevant  for  each  and  with  identical  characteristics 
in  every  other  respect,  we  have  the  difficulty  that  the  reception  of 
the  stimulus  may  not  be  unitary,  but  broken  down  into  several 
parts  separated  by  receptor  orienting  acts  such  as  eye  movements. 
The  first  of  the  two  following  proposals  suffers  from  the  latter 
difficulty;  the  second  from  the  former, 

1st  Experiment:  Digit  Difference  Perception 

Stimuli:  White  3"  x  5"  cards  with  a  triple-spaced  typed,  horizontal 
row  of  vertically  aligned  pairs  of  digits,  0  and  1,  on  each.  The 
number  of  pairs  per  card  to  vary  from  one  to  sixteen.  On  each  card 
either  one  pair  or  no  pairs  will  be  unlike  digits,  i.e.,  (0,1)  or  (1,0); 
the  remainder  like  pairs,  i.e.,  (1,1)  or  (0,0).  The  place  of  the  un- 
like pair  in  the  series  of  pairs  to  vary  from  the  initial  to  the  final 
position.  Cards  with  the  unlike  pair  in  each  of  the  positions  from 
one  to  n  will  be  included  in  the  set  with  equal  frequency,  and 
cards  with  no  unlike  pair  will  be  included  with  the  same  frequency. 
The  assignment  of  (1,1)  or  (0,0)  to  the  remaining  places  will  be 
made  on  an  equiprobable  random  basis,  and  the  choice  of  (0,1)  or 
(1,0)  for  the  unlike  pair  will  be  made  on  the  same  basis. 

Responses'.  Experimenter  will  announce  prior  to  each  stimulus 
presentation  how  many  pairs  the  card  to  be  shown  bears.  Subject 
will  respond  yes  or  no,  depending  on  whether  the  card  does  or  does 
not  bear  an  unlike  pair,  by  pressing  the  appropriate  one  of  two 
keys.  The  subject  will  be  told  that  an  unlike  pair  in  each  of  the 
possible  positions,  including  in  no  position,  are  equally  likely 
events,  and  will  be  instructed  to  read  the  lines  of  pairs  from  left 
to  right.  The  data  of  primary  interest  will  be  the  latencies  of  the 
no  response  to  the  cards  which  bear  no  unlike  pair  and  the  la- 
tencies of  the  yes  response  to  the  cards  which  bear  an  unlike  pair 
in  the  nth  position. 

Apparatus:    1.  Stimulus  cards  as  described  above, 
2.  Light  projector  with  fast  shutter, 
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3.  Three  telegraph  keys:  (a)  for  the  subjects  to  rest 
their  fingers  on  prior  to  response  so  that  the  re- 
sponse will  always  start  from  the  same  situation, 
(b)  for  yes  responses  (c)  for  no  responses. 

4.  A  buzzer  of  ^/^  sec  duration  as  a  warning  signal  to 
be  sounded  ending  1  sec  before  shutter  opens  to 
illuminate  stimulus. 

5.  Recording  chronoscope  accurate  to  at  least  ±  10 
millisec. 

6.  Timer  for  ready  signal  and  shutter  operation  with 
silent  starting  key  for  the  experimenter. 

2nd  Experiment:  Multi-attribute  Perception 

Stimuli:  Ten  decks  of  32  cards  each  to  be  prepared  using  two 
values  on  each  of  five  attributes  according  to  the  following  scheme: 

Attribute  Values 

1.  Number  of  spots  2;  3 

2.  Color  of  spots  Red;  black 

3.  Shape  of  spots  Round;  square 

4.  Arrangement  of  spots  Horizontal  Line;  vertical  line 

5.  Background  color  White;  green 

Responses:  Experimenter  will  announce  what  pattern  of  attributes 
is  to  be  responded  to  positively  prior  to  each  stimulus  presenta- 
tion. Subject  to  make  a  yes  or  no  response  by  pressing  the  appro- 
priate one  of  two  keys  as  exemplified  below: 

Experimenter  Says  Stimulus  Presented       S  to  Respond 

1.  Round  red  Two  black  squares  in  No 

horizontal  line  on  white 
card 

2.  Vertical  line  of  squares      Three  red  squares  in  Yes 
on  green  card                         vertical  line  on  green 

card 

The  instruction-stimulus  pairs  which  call  for  a  negative  response 
should  be  half  of  the  total  number  of  stimuli  presented  in  each 
attribute-pattern  category  so  that  the  uncertainty  of  response  prior 
to  stimulus  presentation  will  be  equalized  at  the  maximum.     The 
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data   of  primary  interest  will  be  the  latencies  of  response  to  the 
set-stimulus  pairs  calling  for  a  yes  response. 

Apparatus'.  Same  as  for  the  first  experiment  except  for  the  stimulus 
cards. 
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This  paper  presents  a  fairly  complete  review  of  detection  theory  as  it  is  applied 
to  certain  psychoacoustic  data.  Detection  theory  is  treated  as  a  combination  of  two 
theoretical  structures:  decision  theory  and  the  concept  of  ideal  observer.  The  paper 
discusses  how  statistical  decision  theory  has  been  used  to  analyze  the  auditory  threshold 
process.  By  treating  the  threshold  process  as  an  instance  of  hypothesis  testing,  two 
determinants  of  the  process  are  recognized:  (1)  the  detectability  of  the  signal  and  (2) 
the  criterion  level  of  the  observer.  The  theory  provides  a  technic  of  analysis  which 
allows  one  to  obtain  a  quantitative  estimate  of  both  factors.  The  measure  of  signal  de- 
tectability appears  to  be  independent  of  the  psychophysical  procedure  when  the  physical 
parameters  of  signal  and  noise  are  held  constant.  The  concept  of  ideal  observer  is  re- 
viewed with  special  emphasis  on  the  assumptions  of  the  derivation.  The  usefulness  of 
this  concept  is  illustrated  by  considering  the  shape  of  the  psychophysical  function — the 
function  relating  the  detectability  of  the  signal  to  its  intensity.  A  rather  general  model 
based  on  the  concept  of  signal  uncertainty  is  presented  which  attempts  to  explain  this 
relationship. 

Introduction 

There  are  two  very  striking  characteristics  of  the  field  of  psychoacoustics.  One 
is  the  breadth  and  variety  of  research  skills  and  techniques  used  to  study  hearing.  The 
techniques  range  from  hydrodynamic  studies  of  the  cochlea  to  analysis  of  the  percep- 
tion of  vowel  forms.  This  multidisciplinary  approach  is  a  fortunate  one  since  it  reduces 
the  chances  that  any  really  significant  aspect  of  the  sensory  system  is  being  overlooked. 
However,  it  creates  a  diversity  which  makes  integration  of  these  areas  most  difficult. 

A  second  characteristic  of  the  field  is  the  lack  of  any  integrative  structure  from 
which  to  view  the  rapidly  expanding  experimental  literature.  If  some  basic  theoretical 
structure  existed,  these  new  data  might  easily  be  integrated  with  the  old.  Psycho- 
acoustics,  however,  does  not  have  any  complete  comprehensive  theory.  A  reflection  of 
this  deficit  is  the  lack  of  consensus  on  methodology.  Often,  even  where  a  general 
consensus  seems  to  exist  in  some  area  of  the  field,  a  new  paper  may  force  a  complete 
re-examination  of  the  entire  measurement  procedure.    A  recent  example  of  the  latter 
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may  be  found  in  the  exchanges  of  Garner^  and  Stevens^  on  the  quantitative  scale  of 
loudness.   Such  a  situation  compounds  the  problem  of  integration. 

This  paper,  therefore,  makes  no  attempt  at  broad  coverage.  The  author  hopes 
that  by  concentrating  on  one  rather  limited  topic  some  positive  contribution  can  be 
made.  This  topic  is  the  detection  of  signals  in  noise.  In  recent  years  a  general  theore- 
tical structure  (detection  theory)  has  been  used  to  analyze  such  experiments.  Un- 
fortunately, there  appears  to  be  some  confusion  both  about  the  theory  itself  and  the 
manner  of  its  application.  The  main  objective  of  this  paper  will  be  to  clarify  these  two 
questions.  Fart  of  the  confusion  about  the  theory  arises  from  the  fact  that  detection 
theory  is  a  combination  of  two  distinct  theoretical  structures:  decision  theory  and  the 
theory  of  ideal  observers.  Before  we  begin  a  detailed  discussion  of  these  two  aspects 
of  detection  theory,  we  will  briefly  outline  them  and  relate  them  to  psychoacoustic 
problems. 

Decision  theory  provides  an  analysis  of  the  process  which  generates  the  dicho- 
tomy between  stimuli  the  subject  reports  he  does  and  does  not  hear.  The  theory 
recognizes  that  a  priori  probabilities,  values,  and  costs  of  correct  and  incorrect  decisions, 
as  well  as  the  physical  parameters  of  the  signal,  play  a  decisive  role  in  establishing  this 
dichotomy.  We  will  find  that  this  dichotomy  is  determined  by  an  adjustable  criterion. 
The  theory  shows  how  a  quantitative  estimate  of  the  criterion  can  be  obtained  from  the 
data. 

There  are  many  psychoacousticians  whose  only  interest  in  this  criterion  is  as  a 
constant  parameter  from  which  to  obtain  substantive  relations  between  two  physical 
parameters,  for  example,  the  absolute  threshold  energy  as  a  function  of  frequency,  or 
the  just  detectable  change  in  power  as  a  function  of  power  (A/  vs  /).  To  them  this 
aspect  of  detection  theory  will  be  of  methodological  interest  only.  Yet  clearly,  if 
factors  such  as  a  priori  probability,  values,  and  costs  do  play  a  role  in  determining  the 
threshold,  their  control  in  substantive  experiments  is  imperative. 

The  second  part  of  detection  theory  is  more  directly  related  to  substantive 
matters — it  is  the  theory  of  ideal  observers.  Briefly,  the  theory  provides  a  collection  of 
ideal  mathematical  models  which  relates  the  detectability  of  the  signal  to  definite 
physical  characteristics  of  the  stimulus.  There  is  a  collection  of  such  models  because 
one  may  make  different  restrictions  on  the  nature  of  the  detection  device.  These 
theoretical  observers  are  rarely  used  as  actual  models  of  the  hearing  mechanism. 
Most  often,  they  are  used  for  the  sake  of  comparing  human  performance  with  that  of 
the  ideal  observer  in  order  to  specify  the  nature  and  amount  of  discrepancy.  This 
comparison,  in  turn,  suggests  either  a  new  and  hopefully  more  accurate  representation 
of  the  hearing  mechanism,  or  new  experiments  to  clarify  further  the  exact  nature  of  the 
discrepancy.   This  will  be  illustrated  in  a  later  section  of  the  paper. 

Decision  Theory 

We  shall  demonstrate,  under  quite  general  assumptions,  how  a  transformation  of 
the  subject's  responses  can  be  utilized  to  determine  both  the  subject's  criterion  and  the 
detectability  of  the  signal.  This  analysis  requires  an  understanding  of  several  basic 
concepts  which  are  rather  complex.   We  might  skip  over  these  fundamentals  and  start, 

^  W.  R.  Garner,  /.  Acoust.  Soc.  Am.,  1958,  30,  1005. 
2  S.  S.  Stevens,  J.  Acoust.  Soc.  Am.,  1959,  31,  995. 
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as  some  previous  expositions  have,  with  some  assumptions  about  Gaussian  distributions 
and  parameters  of  these  distributions.  Such  a  procedure  would  be  unfortunate  because 
it  robs  the  analysis  of  its  generahty  and  implies  that  strong  assumptions  are  needed  to 
justify  its  applicability.    Such  is  not  the  case. 

Typically,  psychoacousticians  try  to  analyze  the  subject's  responses  by  making 
some  assumptions  about  the  way  in  which  the  sound  is  processed  by  the  hearing 
mechanism.  One  assumes,  for  example,  that  the  cochlea  either  makes  a  frequency  analy- 
sis of  the  waveform  or  that  it  does  not,  etc.  We  wish  to  postpone  temporarily  such 
substantive  issues.  Let  us,  for  the  present,  merely  assume  that  each  sound  may  be 
represented  by  a  series  of  numbers.  These  numbers  might  be  the  values  of  a  series  of 
attributes,  or  various  states  of  the  nervous  system.  Whatever  the  representation,  let 
us  call  this  abstraction  an  observation. 

The  problem  we  wish  to  consider  is  this:  Given  an  observation,  what  response 
alternative  should  be  chosen?  What  is  a  good  choice  and  how  can  we  analyze  these 
choices?  We  shall  attempt  to  answer  these  questions  by  considering  a  single  example. 
The  example  is  obviously  specific;  the  generality  rests  in  the  concepts.  The  single 
motive  in  presenting  this  example  is  to  enable  us  to  discuss  these  concepts — likelihood 
ratio,  decision  rule,  and  criterion — with  some  precision  and  yet  avoid  formalism.-^  ■' 
After  this  theoretical  discussion,  we  shall  investigate  the  applicability  of  these  concepts 
to  a  psychoacoustic  experiment. 

An  example  of  decision  theory 

Let  us  assume  we  have  10  observations,  each  observation  {Xj)  represented  by 
three  numbers  [Xj  =  {x^,  x^,  x^)],  and  that  we  have  two  hypotheses,  //j,  H.y,  about  the 
observations.  Given  an  observation,  we  wish  to  decide  whether  the  observation  is  an 
instance  of  H^  or  //g.^  We  shall  assume  we  have  complete  information  about  the 
probability  of  each  observation  given  each  hypothesis. 

By  limiting  the  example  to  10  observations  we  can  work  with  probabilities 
directly.  The  reader  should  note  that  the  three  numbers  {x-^,  x.-^,  x^)  could  have  been 
extended  to  three  hundred.  Everything  that  follows  is  independent  of  the  dimension- 
ality of  the  observation.  The  variables  {x)  of  the  observation  could  be  quantitative 
(integers  or  real  numbers)  or  qualitative  (red,  blue,  or  green).  They  are  simply  de- 
scriptions of  the  observation. 

Likelihood  ratio.  In  Table  I,  we  have  listed  the  observations  and  the  three 
numbers  corresponding  to  each  observation.   The  next  two  columns  provide  the  data 

^  These  concepts  come  from  the  topic  of  statistical  decision  theory  and  the  theory  of 
inference.  Most  of  the  key  theorems  were  first  presented  by  Wald,  who  extended  the  basic 
principle  which  originated  with  Neyman  and  Pearson. 

*  A.  Wald,  Statistical  decision  functions.  New  York:  Wiley,  1950. 

*  J.  Neyman  and  E.  S.  Pearson,  Phil.  Trans.  Roy.  Sac.  London,  1933.  A231,  289. 

"  For  a  concrete  interpretation  of  the  example,  the  reader  might  think  of  the  observation 
as  a  sealed  package,  the  three  numbers  as  the  length,  width,  and  depth  of  the  package,  and  the 
hypothesis  as  whether  the  package  contains  a  toy  car  or  animal.  The  problem,  then,  is  this: 
Given  the  measurements  of  a  package,  guess  whether  it  contains  a  car  or  an  animal.  Alter- 
natively, one  might  think  of  the  observation  as  a  sound  which  can  be  specified  by  three  numbers 
or  attributes.  The  problem  is:  Decide  from  the  three  numbers  whether  the  sound  is  a 
consonant  or  a  vowel. 
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TABLE 

I 

Description  of  the  Observations  (Xj)  and  the  Probability 

of  Obtaining 

That  Observation  Given  Either 

Hypothesis  (H^  or  //a). 

'(■^15  '^2>  '^'s) 

PhMi,  «2>  ^3) 

Observation 

X-l                       iZ-o 

rrg 

^ffl(-l'l,  3^2, 

'^3)    Ph. 

^(Xj,  X2,  X, 

3) 

PhJ.^!,  X2,  ^3) 

^1 

4          3 

3 

0.14 

0.01 

14.00 

-^2 

3          3 

5 

0.01 

0.01 

1.00 

^3 

2          2 

4 

0.03 

0.30 

0.10 

^4 

3          3 

3 

0.30 

0.10 

3.00 

^5 

2          3 

3 

0.02 

0.04 

0.50 

^6 

5          2 

2 

0.09 

0.01 

9.00 

^7 

2          5 

5 

0.10 

0.08 

1.25 

^8 

3          4 

5 

0.20 

0.05 

4.00 

^9 

3          2 

5 

0.06 

0.30 

0.20 

^10 

4          2 

5 

0.05 
Total  1.00 

0.10 
1.00 

0.50 

on  the  probabilities  of  each  observation  on  each  hypothesis.  The  final  column  is  simply 
the  ratio  of  the  fifth  column  to  the  sixth  and  represents  the  likelihood  ratio.  The 
likelihood  ratio,  then,  is  the  probability  that  a  particular  observation  resulted  from 
H^  divided  by  the  probability  that  it  resulted  from  H^.  The  likelihood  ratio  gives  what 
some  call  the  "odds."  If  we  have  {X^)  we  should  be  willing  to  wager  nine  cents  to  one 
that  H^  is  correct.  Note  that  the  likelihood  ratio  is  a  number,  not  a  probability,  and 
that  this  number  is  a  function  of  three  variables  {x-^^,  x^,  x^.  Thus  we  have  taken  an 
observation  which  is  specified  by  three  values  (x^,  x^,  x^,  and  related  it  to  a  single 
variable  l{x-^,  x^,  x^. 

The  reason  we  have  performed  this  transformation  is  simply  stated:  We  can 
make  optimum  decisions  if  we  use  the  likelihood  ratio.  We  have  not  stated  what  we 
mean  by  optimum,  but  let  us  take  up  this  point  a  little  later.  First,  let  us  show  how  we 
might  use  the  likelihood  ratio  in  making  decisions. 

Decision  rule.  If  someone  asks  us  to  make  a  decision  about  a  particular 
observation,  whether  it  is  an  instance  of  H^  or  H^,  we  would  probably  guess  it  was  H^ 
if  the  probability  of  that  observation  was  greater  on  H^  than  on  H^.  Such  a  statement 
is  called  a  decision  rule.  In  terms  of  likelihood  ratio  this  decision  can  be  expressed  as 
follows:  Choose  H^  if  1{X)  >  1.  In  effect,  we  have  specified  our  decision  rule  by 
choosing  one  number;  in  this  case,  the  number  "one."  This  number  is  called  a  criterion 
or,  more  precisely,  a  likelihood-ratio  criterion. 

Suppose  that,  independent  of  any  specific  observation,  //g  was  ten  times  as 
likely  as  H^.  Clearly,  we  would  not  maintain  our  previous  criterion;  even  without 
knowing  the  characteristics  of  the  observation,  the  odds  are  ten  to  one  in  favor  of  i/g- 
It  turns  out  in  this  case  that  we  should  choose  H^  only  \U{X)  >  10.  That  is,  we  should 
choose  H^  only  if,  in  our  example,  the  specific  observation  is  X  =  (4,  3,  3). 

Similarly,  if  we  place  asymmetrical  values  and  costs  on  the  various  correct 
and  incorrect  decisions,  we  should  change  our  criterion  or  likelihood  ratio  accordingly. 
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Monotonic  functions  of  likelihood  ratio.  While  we  can  state  our  decision  pro- 
cedure in  terms  of  likelihood  ratio,  there  are  other  exactly  equivalent  ways  of  stating 
the  decision  rules.  In  the  example,  it  so  happens  that  the  product  a\  times  .»•.,  minus  .T3 
is  also  an  optimum  decision  quantity.  This  is  true  because  this  quantity  is  monotonic 
with  the  likelihood  ratio.  The  criterion  number  is  not  the  same  as  that  we  would  use 
on  a  likelihood-ratio  scale,  but  there  is  always  some  number  on  this  monotonic  scale 
which  corresponds  to  the  criterion  number  on  likelihood  ratio.  For  example,  suppose 
we  select  the  alternative  H^  if  /(-^i, -^2,  .T3)  >  1.25;  then  we  would  make  identical 
decisions  using  the  decision  rule,  select  H^  if  {x\  ■  .rg  —  x^  >  5.00. 

In  many  cases,  such  as  the  application  of  this  theory  to  psychoacoustics,  the 
decision  axis  is  unobservable,  and  hence  we  are  only  interested  in  equivalent  decision 
procedures.  To  say  the  observer  uses  an  optimum  decision  procedure  means  only  that 
he  is  using  a  monotonic  transformation  of  likelihood  ratio. 

Optimum  nature  of  likelihood  ratio.  We  turn  now  to  the  very  important  ques- 
tion of  the  optimum  nature  of  likelihood  ratio.  Clearly  a  decision  procedure  based  on 
likelihood  ratio  is  only  optimum  if  it  best  attains  some  specific  objective.  Let  us  list 
some  of  these  objectives  to  indicate  their  generality:  (1)  maximize  the  expected  value 
of  decisions,^  (2)  minimize  risk,^  (3)  estimate  a  posteriori  probability,^  (4)  maximize  the 
percentage  of  correct  decisions,^  and  (5)  set  the  error  rate  on  some  decision  alter- 
native at  some  constant  and  maximize  the  number  of  correct  decisions  for  the  other 
alternative.^  The  impressive  fact  is  that  a  decision  criterion  based  on  likelihood  ratio  is 
optimum  under  all  the  above  objectives.  Naturally  this  criterion  may  be  different  for 
different  objectives.  The  references  listed  with  the  objectives  contain  a  more  detailed 
explanation  of  each  objective  and  prove  how  a  decision  rule  based  on  likelihood  ratio, 
or  some  monotonic  transformation  of  that  quantity,  may  be  used  to  make  the  best 
decisions.^" 

Distribution  of  likelihood  ratio.  We  have  seen  how  each  observation,  indepen- 
dent of  the  number  of  attributes  included  in  the  observation,  can  be  reduced  to  a  single 
quantity — likelihood  ratio.  Likelihood  ratio  is  simply  a  function  of  several  variables 
and  for  any  single  observation  is  simply  a  number.  We  may  then  properly  consider  a 
probability  defined  on  the  variable  likelihood  ratio.  Let  us  consider,  in  particular,  the 
probability  that  we  shall  obtain  a  particular  value  of  likelihood  ratio  under  H^  and  Ho 
of  the  preceding  example.  Table  II  shows  these  probabilities  and  the  corresponding 
cumulative  distributions  for  both  hypotheses  of  our  example.  The  likelihood  ratio  is 
ranked  from  largest  to  smallest  to  facilitate  the  explanation  of  the  ROC  curve.^-^ 

ROC  curves  and  their  properties.  We  shall  use  Table  II  to  construct  an  ROC 
(Receiver  Operating  Characteristic)  curve.  To  do  this,  let  us  assume  the  decision  rule 
is  to  accept  H^  if  l{x^,  x^,  x^)  ^  k.   If  ^  =  14  we  find  that  the  probability  of  accepting 

'  W.  W.  Peterson,  T.  G.  Birdsall,  and  W.  C.  Fox,  Trans.  IRE,  1954,  PGIT-4,  171. 

^  T.  W.  Anderson,  An  introduction  to  multivariate  statistical  analysis,  New  York: 
Wiley,  1958. 

^  P.  M.  Woodward,  Probability  and  information  theory  with  applications  to  radar. 
New  York:   McGraw-Hill,  1955. 

^"  To  estimate  a  posteriori  probability  no  criterion  is  involved.  In  this  case  the  best 
estimate  of  a  posteriori  probability  is  a  simple  monotonic  transformation  of  likelihood  ratio. 

"  Note  that  since  two  observations  yield  a  likelihood  ratio  of  0.50,  we  have  added  the 
probabilities  under  both  hypotheses  to  obtain  the  probability  of  that  likelihood  ratio. 
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TABLE  II 

Probability  under  Each  Hypothesis  that 
KX)  Will  Have  a  Certain  Value. 


IW 

PhSKX)] 

Cumulative 

Ph.VW] 

Cumulative 

14.00 

0.14 

0.14 

0.01 

0.01 

9.00 

0.09 

0.23 

0.01 

0.02 

4.00 

0.20 

0.43 

0.05 

0.07 

3.00 

0.30 

0.73 

0.10 

0.17 

1.25 

0.10 

0.83 

0.08 

0.25 

1.00 

0.01 

0.84 

0.01 

0.26 

0.50 

0.07 

0.91 

0.14 

0.40 

0.20 

0.06 

0.97 

0.30 

0.70 

0.10 

0.03 

1.00 

0.30 

1.00 

Hi  when  it  is  true  [Pjj  (Hj)]  is  0.14  and  the  probability  of  accepting  H^  when  it  is  false 
[Pjj  (Hi)]  is  0.01.  By  decreasing  k,  we  change  both  probabilities.  The  upper  curve 
shown  in  Fig.  1  shows  how  the  probabilities  change  as  a  function  of  k,  and  is  called  an 
ROC  curve.  The  two  probabilities  completely  represent  the  stimulus-response  matrix 
in  a  two-alternative  detection  task  since  the  complements  of  Pjj  (Hj)  and  Pjj  (Hi)  are 
the  two  remaining  cells  in  the  stimulus-response  matrix. 
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Figure  1 
The  receiver  operating  characteristic  (ROC)  curve  of  the  example.  The  axes  are  Pji_J^H^),  which 
is  the  probability  of  responding  H^  if  the  observation  was  from  H^,  and  PgJ^H^),  which  is  the 
probability  of  responding  H^  if  the  observation  was  from  H^.   The  points  were  plotted  from 

Table  II. 
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What  if  some  decision  procedure  which  is  less  than  optimum  were  used?  Let 
us  consider  an  extremely  poor  decision  procedure.  The  lower  curve  of  the  figure  was 
generated  by  using  the  decision  rule  accepting  H^  if  l{x-^^,  x^,  x^)  <  k  for  all  k.  This  is 
the  exact  opposite  of  the  first  decision  rule  and  hence  generates  the  ROC  curve  for  the 
worst  possible  decision  rule. 

The  area  included  between  the  upper  and  lower  bounds  on  performance  re- 
presents attainable  performance  using  any  decision  procedure  in  this  task.  Obviously 
any  single  decision  is  either  right  or  wrong,  but  any  decision  rule  whatever,  in  the  long 
run,  will  produce  some  probability  of  "hit"  and  some  probability  of  "miss"  which  lie 
within  the  bounds  illustrated.^^  Other  decision  procedures  do  not  necessarily  involve 
likelihood  ratio.  One  procedure  would  be  to  flip  a  coin  and  select  the  first  alternative 
if  the  coin  landed  heads;  if  the  coin  were  unbiased,  this  decision  rule  would  achieve 
an  error  and  hit  rate  of  0.5  Should  the  coin  be  biased,  this  decision  procedure  would 
produce  performance  located  somewhere  along  the  center  diagonal  of  Fig.  1 . 

Another  point  to  be  noted  involves  the  slope  of  the  ROC  curve  based  on  the 
optimum  decision  axis.  Notice  that  the  slope  between  any  two  consecutive  points  is 
equal  to  the  likelihood  ratio  of  the  higher  point.  Thus  the  slope  must  clearly  diminish 
because  each  successive  point  represents  a  lower  value  for  likelihood  ratio.  Any  ROC 
curve  which  does  not  show  a  monotonically  decreasing  slope  implies  an  incorrect 
decision  rule.  This  means  that  the  decision  maker  is  accepting  the  first  hypothesis  when 
the  likelihood  exceeds  a  certain  value  and  yet  accepting  the  other  hypothesis  when 
likelihood  ratio  is  some  greater  value.  Any  such  inversion  in  slope  for  any  ROC  curve 
implies  that  better  performance  might  be  achieved  by  interchanging  some  of  the  points 
on  the  decision  axis. 

ROC  curve  and  percent  correct  using  forced  choice.  The  ROC  curve  is  useful  in 
a  situation  where  the  subject's  response  is  limited  to  selecting  one  or  the  other  alter- 
native. There  are  other  ways  in  which  the  detection  task  may  be  structured;  one 
involves  the  class  of  forced-choice  procedures.  For  simplicity,  we  will  consider  a  two- 
alternative  forced-choice  task.  The  extension  to  larger  numbers  of  alternatives  should 
be  clear  from  the  following  discussion.  A  two  alternative  forced-choice  procedure  is 
one  in  which  two  stimuli  are  presented,  one  from  each  class,  and  the  subject  is  asked, 
in  eflFect,  what  was  the  order  of  the  stimuli :  H^H^  or  H^H^  ? 

We  shall  calculate  the  probability  of  a  correct  decision  based  on  the  following 
rule:  Select  the  alternative  H1H2  if  the  likelihood  ratio  on  the  first  observation  is 
greater  than  on  the  second.  In  efi'ect,  this  rule  says  to  pick  the  larger  likelihood  ratio 
and  say  H^  for  that  observation.  The  reason  for  considering  only  this  particular  deci- 
sion rule  is  that  this  assumption  is  often  made  in  the  analysis  of  forced-choice  tests. ^^ 

Assuming  the  subject  picks  the  larger  of  two  likelihood  ratios  and  says  the 

^^  It  should  also  be  noted  that  the  lines  connecting  the  points  in  the  ROC  curve  do  in 
fact  represent  attainable  performance.  For  example,  a  point  located  midway  between  the 
points  (7,  43)  and  (17,  73)  is  attainable  by  using  a  mixed-decision  procedure,  where  //j  is 
accepted  if  1{X)  >  3,  each  alternative  is  selected  half  the  time  by  some  random  procedure 
if  l(X)  =  3,  and  H^  is  selected  if  l(X)  <  3. 

13  Were  we  to  give  a  complete  analysis  of  this  situation  we  would  first  list  all  possible 
stimulus  pairs  (5,5,).  Next  we  would  consider  the  probabilities  on  the  hypothesis  that  the  pairs 
represented  instances  of  HiH.^  or  HM^,  compute  a  likelihood  ratio,  and,  in  fact,  derive  an  ROC 
curve  based  on  these  computations. 
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TABLE  III 

Calculation  of  the  Probability  of  a  Correct  Response 
in  a  Forced-Choice  Test. 


k 

PhM^")  = 

k] 

PhJIoW  <  k] 

Product 

14 

0.14 

0.99 

0.1386 

9 

0.09 

0.98 

0.0882 

4 

0.20 

0.93 

0.1860 

3 

0.30 

0.83 

0.2490 

1.25 

0.10 

0.75 

0.0750 

1.00 

0.01 

0.74 

0.0074 

0.50 

0.07 

0.60 

0.0420 

0.20 

0.06 

0.30 

0.0180 

0.10 

0.03 

0.00 

Sum 

0.0000 

0.8042 

likelihood  ratio  was  produced  by  H^,  we  shall  be  correct  if  the  larger  likelihood  was  in 
fact  produced  by  H^  and  the  smaller  was  in  fact  produced  by  //g.  The  probability  of 
this  occurrence  is  Pjj^[l^iX)]  ■  PjiJIzi^)]  where  l^iX)  >  l^iX).  In  fact,  if  the  larger 
likelihood  ratio  is  equal  to  k,  the  probability  of  a  correct  choice  is  simply:  Pjj  [1{X) 
=  k]  ■  ^iPfj  Uii^)  <  k]}^  To  obtain  the  final  result  we  need  only  summate  over  all 
the  values  of  k,  since  any  of  these  values  might  be  the  largest,  except  the  lowest  value 
of  likelihood  ratio  itself. 

Table  III  gives  these  calculations  and  the  final  answer  (0.8042).  While  the 
method  of  calculating  this  probability  is  straightforward,  often,  especially  in  psycho- 
acoustic  experiments,  one  does  not  have  numerical  distributions  on  a  likelihood-ratio 
scale.  Two  approaches  could  be  used  in  these  situations.  The  first,  and  the  safest, 
since  it  makes  no  additional  assumptions,  would  be  to  compute  the  probability  from 
an  experimentally  determined  ROC  curve.  If  you  look  at  Table  III  closely,  you  will 
see  that  the  quantities  used  in  the  calculation  are  simply  Af^  (//J  times  [1  —  Pfj  (H^)] 
for  each  successive  point  on  the  ROC  curve  (Fig.  1 ).  Obviously,  the  accuracy  of  such  a 
procedure  is  heavily  determined  by  the  accuracy  of  the  experimental  estimate  of  the 
ROC  curve.  The  merit  of  the  technique  is  that  no  assumptions  beyond  that  of  the 
decision  rule  are  necessary  to  predict  forced-choice  behavior  from  the  ROC  data. 

A  second  procedure,  one  which  has  often  been  used,  is  to  make  some  assump- 
tions about  the  distributions  which  generated  the  ROC  curve  and  then  use  these  assump- 
tions in  predicting  behavior  in  the  forced-choice  experiment.  The  most  popular  set  of 
assumptions  is  that  the  distribution  of  observations  on  the  likelihood-ratio  axis,  or 
some  monotonic  function  of  that  axis,  is  normal  or  Gaussian  under  both  hypotheses. 
The  distributions  are  assumed  to  differ  only  in  their  means  and,  sometimes,  in  their 
standard  deviations.  Let  us  assume,  for  simplicity,  that  standard  deviations  are  equal 
under  both  hypotheses,  then  the  ROC  curve  can  be  characterized  by  one  parameter; 

"  If  more  than  two,  say  M,  alternatives  are  used  in  the  forced-choice  test,  the  equation 
becomes 


P(correct)  =  2^^/^  =  ^) 

k 


^Pu^ili  <  k) 
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the  difference  in  the  means  divided  by  the  standard  deviation  (AM/rr).  This  parameter 
is  usually  denoted  by  d'  =  ^.Mja.  The  calculations  of  the  probability  of  a  correct 
detection  in  a  two-alternative  forced-choice  situation  if  these  assumptions  are  made  are 
quite  simple.  The  probability  that  one  likelihood  is  larger  than  another  is  the  prob- 
ability that  the  difference  is  greater  than  zero.  Since,  by  assumption,  some  transforma- 
tion of /(A')  is  normal,  the  difference  distribution  is  normal  with  a  mean  of  AM  and  a 
variance  equal  to  the  sum  of  the  original  variances.  Hence  the  probability  of  a  correct 
decision  is 

PCcorrect,  2  alternative)  =  a)[AM/(CTf  -|-  0%)^]  =  0[^7(2)']. 

The  probability  of  being  correct  for  any  number  of  alternatives  is  given  in  footnote 
reference  15. 

We  have  now  reviewed  all  the  essential  aspects  of  how  detection  theory  uses 
decision  theory  in  analyzing  the  process  of  detection.  Let  us  now  turn  to  some  experi- 
mental results  and  see  to  what  extent  these  notions  are  supported.  Following  this 
review  of  the  experimental  studies,  we  shall  conclude  this  section  with  a  discussion  of 
the  implications  of  these  studies  for  psychoacoustic  procedures  in  general. 

Experimental  results 

ROC  curve.  One  of  the  earlier  studies^^  simply  sought  to  determine  experi- 
mentally the  shape  of  the  ROC  curve  in  a  simple  psychoacoustic  task.  The  signal  was 
a  1/10  second  of  a  1000-cps  sinusoid.  White  noise,  the  masking  stimulus,  was  present 
continuously  throughout  the  experimental  session.  A  light  occurred  to  mark  the  ob- 
servation interval.  During  this  interval  either  the  signal  was  added  to  the  noise  (SN) 
or  simply  the  noise  was  presented  (N):  these  were  the  two  hypotheses  of  the  detection 
task.  The  subject  gave  one  of  two  possible  responses;  he  pressed  one  button  if  he 
believed  the  signal  was  present  ("yes")  or  pressed  a  second  button  if  he  believed  no 
signal  was  present  ("no").  The  physical  parameters  of  the  situation,  including  noise 
and  signal  levels,  were  held  constant.  The  independent  variable  was  the  probability 
(0  priori)  of  a  signal  being  present.  Five  levels  of  a  priori  probability  were  selected 
(0.1,  0.3,  0.5,  0.7,  0.9)  and  the  one  used  for  a  given  session  of  300  observations  was 
announced  to  the  subject.  After  the  subject  responded,  he  was  given  immediate 
information  as  to  whether  or  not  the  signal  had  in  fact  been  presented.  The  subject 
was  awarded  some  fraction  of  a  cent  for  each  correct  answer  and  fined  an  equal  amount 
for  each  incorrect  answer.    He  was  instructed  to  make  as  much  money  as  possible. 

The  results  for  one  of  the  subjects  are  presented  in  Fig.  2.  [Py;{A)  is  the  prob- 
ability of  saying  "yes"  when  noise  alone  was  presented.]  The  general  trend  of  the  data 
supports  the  decision-theory  analysis.  The  curve  drawn  is  generated  by  assuming  the 
distributions  on  likelihood  ratio  are  normal  under  both  hypotheses.  The  normalized 
difference  between  the  means  is  0.92. 

Threshold  model  and  the  ROC  curve.  Before  considering  whether  or  not  the 
subjects  adopted  the  proper  criterion  so  as  actually  to  maximize  their  payoff,  let  us 
consider  one  alternative  explanation  of  the  data.  This  is  the  so-called  threshold  model. 

^=  P.  B.  Elliott,  Electronic  Defense  Group,  University  of  Michigan,  Technical  Report 
No.  97,  1959. 

"  W.  P.  Tanner,  J.  A.  Swets,  and  D.  M.  Green,  Electronic  Defense  Group,  University  of 
Michigan,  Technical  Report  No.  30,  1956. 
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Figure  2 
A  sample  of  the  ROC  curve  from  an  auditory  detection  experiment.   See  footnote  16.  P^{A) 
is  the  probabihty  of  responding  "yes"  when  noise  alone  was  presented.  Psn(^)  is  the  proba- 
bility of  saying  "yes"  when  signal-plus-noise  was  presented.  These  probabilities  were  estimated 
from  the  stimulus-response  matrix.   See  text  for  details  of  the  experiment. 

The  essentials  of  this  model  are  that  the  signal,  when  added  to  the  noise,  augments  some 
process  within  the  organism,  such  that  if  the  increment  reaches  a  critical  level  called 
the  threshold,  the  signal  is  heard  and  can  be  correctly  detected.  So  far,  we  note  no 
great  difference  with  the  decision-theory  analysis  except  in  semantics.  If  one  calls  the 
decision-theory  criterion  a  threshold  and  the  hypothetical  process  likelihood  ratio, 
the  correspondence  is  complete.  The  differences  between  the  models  appear  when  one 
considers  "subthreshold"  events  and  the  procedures  used  to  deal  with  these  events. 
The  threshold  model  assumes  that  should  the  signal  increment  fail  to  reach  the  thresh- 
old, the  subject  can  only  make  a  pure  guess  as  to  whether  or  not  the  signal  is  present. 
This  is  surely  true  since  anything  below  the  threshold  is  just  that.  If  ordering  is  pre- 
served below  the  threshold,  the  word  has  no  meaning.  The  difference  in  terminology 
between  criterion  and  threshold  is  important,  for  to  say  the  subject  adopts  a  criterion 
is  to  simply  say  an  arbitrary  cut  point  on  a  continuum  is  used  as  the  decision  rule. 

Given  that  the  subject  guesses  about  events  which  are  "subthreshold,"  he  may, 
if  blanks  are  ever  employed,  report  the  signal  is  present  when  it  is  not  (false  positive 
response).  Two  techniques,  both  consistent  with  the  threshold  assumption,  might  be 
employed  if  this  occurs.  One  procedure  widely  used  is  to  instruct  the  subject  to  be  more 
careful;  this  can  be  interpreted  as  an  attempt  to  instruct  the  subject  to  respond 
negatively  to  all  "subthreshold"  events.  The  implication  of  this  procedure  will  be  dis- 
cussed in  a  later  section.  Another  procedure,  equally  valid  from  the  assumptions  of  this 
model,  would  be  to  employ  a  correction  for  guessing.  This  correction  procedure  as- 
sumes the  guessing  mechanism  and  the  sensory  mechanisms  are  independent.  The 
excellent  experiments  of  Smith  and  Wilson^''  were  the  first,  I  believe,  to  show  the  in- 
adequacy of  this  second  procedure.  This  fact  led  them  to  reconsider  the  entire  notion 

1'  M.  Smith  and  E.  A.  Wilson,  Psychol.  Monogr.,  1953,  67,  Whole  No.  359. 
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of  the  threshold  and  they  presented,  as  an  alternative  model,  one  very  similar  to  that 
suggested  by  decision-theory  analysis.  (See  especially  Sec.  IV,  footnote  17.)  Munson 
and  Karlin,*'^  using  an  information-theory  analysis,  investigated  the  detection  process 
under  "absolute  threshold  conditions."  In  order  to  deal  with  false  positive  responses, 
they  proposed  a  "discriminant  level  model."  This  model  is  also  very  similar  to  that 
suggested  by  decision-theory  analysis. 

The  threshold  model  could  still  attempt  to  account  for  the  data  shown  in  Fig.  2. 
The  argument  would  run  as  follows:  Suppose  the  subject  achieves  some  hit  and  false- 
alarm  rate.  If  the  situation  is  changed  in  some  way,  he  can  modify  his  behavior  by 
simply  giving  more  "yes"  responses.  Since  this  guessing  rate  is  independent  of  the 
stimulus  conditions  (both  noise  and  signal-plus-noise  events  are  below  the  threshold) 
this  will  increase,  by  the  same  relative  amounts,  both  the  hit  and  false-alarm  rates. 
In  short,  a  linear  function  will  result.  In  the  extreme,  the  subject  says  "yes"  all  the  time, 
hence  this  linear  function  must  go  through  the  point  in  the  upper  right-hand  corner 
[P^{A)  =  \. 00,  P^^(A)  =  1.00].  Thus  the  threshold  prediction  for  the  data  is  a  collec- 
tion of  lines  having  the  upper  right-hand  corner  as  the  common  intercept,  and  a  slope 
depending  upon  the  detectability  of  the  signal.  No  linear  function  which  has  this 
intercept  as  one  value  can  fit  more  than  a  few  of  the  data  points  for  any  value  of  the 
slope.  The  results  of  this  first  experiment,  then,  seriously  conflict  with  this  version  of 
the  threshold  model  and  give  some  measure  of  support  to  the  decision-theory  analysis. 

The  conflict  between  some  version  of  the  threshold  model  and  the  decision 
analysis  has  been  the  subject  of  considerable  experimental  effort.  There  are  other 
experimental  results  more  damaging  to  the  threshold  position.  These  experiments 
attack  the  threshold  concept  directly  because  they  suggest  that  ordering  below  the 
threshold  value  is  indeed  possible. ^^  We  shall  drop  this  conflict  and  proceed  to  other 
questions. 

Actual  criterion  and  optimum  criterion.  Let  us  now  return  to  the  results  dis- 
played in  Fig.  2  and  discuss  the  question  of  the  optimum  criterion.  It  turns  out  that  if 
one  wishes  to  select  an  optimum  criterion  on  likelihood  ratio,  it  is  equal  to  ji  = 
P(N)IP{SN),  where  P  is  the  criterion  value  on  likelihood  ratio  and  P(N )  and  /*(SN)  are 
the  a  priori  probabilities  of  noise  alone  and  signal-plus-noise,  respectively.  We  can, 
of  course,  obtain  a  rough  measure  of  the  subject's  criterion  by  measuring  the  slope  of 
the  ROC  curve  at  the  point  nearest  the  experimental  data  point.  This  rough  compari- 
son is  displayed  in  Fig.  3.  Note  that  while  there  is  a  strong  relation  between  the 
estimated  and  optimal  criterion  values,  there  is  also  a  consistent  departure  from  an 
exact  correspondence.  The  general  trend  might  be  summarized  by  saying  the  subjects 
are  conservative;  they  tend  to  adopt  criteria  which  are  not  as  diff"erent  from  f!  =  1 
as  they  should  be.  This  result  is  almost  an  inevitable  consequence  of  the  procedure. 
The  way  in  which  expected  values  change  for  various  criterion  levels  is  the  crux  of  the 
problem.  This  topic  is  discussed  in  more  detail  in  Appendix  A. 

Since  these  earlier  investigations,  other  procedures  have  been  utilized  to  vary  the 
subject's  criterion.  One  which  seems  more  straightforward  and  is  certainly  successful 
is  simply  to  instruct  the  subject  verbally  to  adopt  diff'erent  criteria  such  as  lax  or  very 
strict,  or  even  to  instruct  the  subject  to  maintain  a  certain  value  for  Py{A)}^ 

18  W.  A.  Munson  and  J.  E.  Karlin,  J.  Acoust.  Soc.  Am.,  1956,  26,  542. 

1^  J.  P.  Egan,  A.  I.  Schulman,  and  G.  Z.  Greenberg,  J.  Acoust.  Soc.  Am.,  1959,  31,  768. 
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Figure  3 

Comparison  of  the  optimum  and  obtained  criterion  levels.    This  criterion  level,  /5,  is  the 

equivalent  of  the  criterion  level  on  likelihood  ratio.   The  optimum  criterion  is  obtained  by 

assuming  normal  statistics  for  both  hypotheses.    It  is  equal  to  [1  —  i'(SN)]//'(SN),  where 

/•(SN)  is  the  a  priori  probability  of  the  signal. 

Measure  of  detectability.  Let  us  turn  now  from  the  question  of  the  criterion 
and  its  adjustment  to  another  aspect  of  detection-theory  analysis,  the  measure  of 
detectability,  and  more  specifically,  whether  or  not  this  measure  remains  relatively 
invariant  over  different  experimental  procedures.  How  one  can  compare  different 
measurements  obtained  using  different  experimental  procedures  is  an  important  ques- 
tion, not  only  for  psychoacousticians  but  for  any  scientific  enterprise.  Let  us  review 
the  evidence  on  the  extent  to  which  detection-theory  analysis  has  permitted  such  a 
comparison.  If  we  make  the  usual  assumption  that  the  distribution  of  likelihood  is 
normal  with  equal  variance  on  both  hypotheses,  as  in  the  situation  outlined  in  the  first 
experiment,  then  the  measure  of  detectability  is  d'. 

A  paper  by  Swets^"  has  considered  the  applicability  of  this  detectability  index 
for  yes-no  and  forced-choice  procedures ;  he  has  also  compared  predicted  and  obtained 
results  using  two,  three,  four,  six,  and  eight  alternatives  in  the  forced-choice  procedure. 
In  general,  these  predictions  based  on  d'  hold  up  remarkably  well.  The  worst  failure 
reported  seems  to  be  about  1  db;    no  consistent  error  trend  is  evident  in  the  data. 

Another  method  of  generating  ROC  curves,  first  suggested  by  Swets  et  al}^  has 
been  employed.  Egan  et  al}^  tested  and  compared  this  method  with  the  standard 
yes-no  procedure.  In  the  single  observation  or  yes-no  procedure,  the  decision-theory 
analysis  claims  that  the  subject  adopts  a  single  criterion  and  this  determines  a  "yes" 
or  "no"  response.  The  experimenter,  then,  is  employing  the  subject  as  a  threshold 
device.  Alternatively,  the  experimenter  could  have  the  subject  report  a  number  after 
each  observation  such  as  likelihood  ratio;  from  these  numbers,  the  experimenter  could 
construct  an  ROC  curve  by  placing  various  criteria  on  the  likelihood  ratios  reported. 

The  rating  procedure  is  a  compromise  between  these  two  extremes.  The  subject 
in  the  rating  procedure  is  asked  to  place  each  observation  in  one  of  several  categories; 

2"  J.  A.  Swets,  /.  Acoust.  Soc.  Am.,  1959,  31,  511. 

^^  J.  A.  Swets,  W.  P.  Tanner,  and  T.  G.  Birdsall,  Electronic  Defense  Group,  University 
of  Michigan,  Technical  Report  No.  40,  1955. 
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the  top  one  being  used  for  sureness  of  a  signal's  presence,  the  next  for  a  lesser  degree  of 
sureness,  and  so  forth.  ROC  curves  are  subsequently  constructed.  One  can  then  com- 
pare the  measure  of  signal  detectability  obtained  from  these  two  procedures,  yes-no 
and  rating.  Egan  et  alP  found  these  two  measures  differed  for  his  three  subjects  by 
0.3,  0.4,  and  0.1  db,  differences  probably  well  within  the  experimental  error. 

In  summary  then,  we  have  seen  how  decision  analysis  allows  one  to  predict 
within  a  fairly  wide  range  of  psychoacoustic  procedures.  The  forced-choice  procedures 
using  two  to  eight  alternatives  and  a  single-interval  procedure  using  two  to  four 
categories  of  response  can  be  summarized  by  a  single  measure  of  detectability,  a 
measure  which,  for  practical  purposes,  is  invariant. 

Implications  for  psychoacoustic  methods.  The  more  traditional  methods  of 
psychoacoustics  utilize  some  parameter  of  the  signal  such  as  the  threshold  energy. 
This  value  is  obtained  by  an  analysis  of  the  subject's  responses.  Many  of  these  methods 
do  not  allow  one  to  determine  directly  the  subject's  criterion  and  in  most  methods  it 
is  presumed  to  be  constant. 

Let  us  investigate  how  variation  in  the  subject's  criterion,  if  it  occurs,  will  affect 
the  estimate  of  the  threshold  energy.  Variation  of  the  subject's  criterion  affects  the 
false-alarm  rate  P^{A).  Figure  4  shows  how  the  probability  distribution  for  signal- 
plus-noise  must  be  varied  as  the  false-alarm  rate  P^{A)  is  changed  to  maintain  a  con- 
stant value  of  signal  detection  P^^{A).  We  have  assumed  Gaussian  distribution  and 
equal  variance  to  construct  the  solid  line  of  the  figure.  The  insert  displays  the  essentials 
of  the  calculations  and  shows  how  a  change  in  P^{A)  of  from  0.10  to  0.01  necessitates 
a  change  in  the  mean  of  the  signal  distribution  from  1.3  to  3.1  in  order  to  maintain 
^sn(^)  —  0-50.  This  value  oi  P^^{A)  is  a  reasonable  one  since  it  is  often  used  as  the 
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Figure  4 
Evaluation  of  how  a  change  in  criterion  will  influence  the  size  of  the  "threshold"  signal. 
P-^iA)  is  the  false-alarm  rate;   a  "yes"  response  to  no  signal.    The  hit  rate,  P^^(A),  was  held 
constant  at  0.5.   The  mean  of  the  signal  distribution  was  varied  (see  insert)  to  achieve  this  hit 
rate  for  various  values  oi  P^{A).   The  constant,  C,  was  chosen  so  that  10  log  1.3  -|-  C  =  0. 
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estimate  of  "threshold."  Very  small  values  of  false-alarm  rate  were  used  because  most 
methods  control  this  parameter  to  the  extent  of  keeping  it  very  low. 

We  cannot  say  generally  how  this  change  in  the  mean  of  the  signal  distribution 
is  related  to  any  signal  parameter.  However,  for  sinusoidal  signals  in  noise,  d'  is 
roughly  proportional  to  signal  energy ;  thus  the  "estimated  threshold"  may  vary  over  a 
6-db  range  depending  on  the  criterion  of  the  subject.  (In  other  experiments  d' 
varies  with  signal  voltage — hence  the  range  might  be  12  db.  See  Fig.  7  and  the  dis- 
cussion.) 

This  change  in  the  estimated  threshold,  of  say  6  db,  will  only  occur  if  the  sub- 
ject's criterion  changes.  One  may  be  willing  to  assume  that  it  is  approximately  con- 
stant over  the  course  of  the  experiment. ^^  Then  this  number,  6  db,  could  be  interpreted 
as  a  tolerable  difference  in  comparing  two  sets  of  different  measurements.  The  theory, 
then,  is  consistent  with  the  rather  wide-spread  view  in  psychoacoustics ;  namely,  that 
results  obtained  using  different  methods  should  not  be  expected  to  show  exact  con- 
gruence. Whether  these  differences  are  large  enough  to  warrant  concern  depends  both 
on  the  particular  nature  of  the  problem  and  the  precision  desired. 

Decision  analysis  and  speech  research.  The  use  of  ROC  curves  and  the  measure 
d'  has  not  been  limited  to  detection  experiments.  Since  some  confusion  has  been 
generated  by  the  multiplicity  of  d'  measures,  this  issue  deserves  some  attention. 

Figure  5  displays  an  ROC  curve  taken  from  a  report  by  Egan.^^  The  similarity 
between  this  figure  and  Fig.  2  is  apparent,  even  though  measures  employed  to  construct 
this  graph  differ  greatly.  The  procedure  here  is  as  follows:  A  word  is  presented  in 
noise  to  a  listener  who  writes  down  the  word  he  thinks  was  presented.  He  then  checks 
whether  or  not  he  believes  this  identification  response  is  correct.  The  conditional 
probabilities  of  the  receiver  saying  he  was  correct  on  those  words  where  he  in  fact  was, 
and  was  not  correct,  define  the  ordinate  and  abscissa  respectively  of  Fig.  5. 

Egan's  ROC  curve,  then,  is  constructed  from  a  table  of  response-response 
contingencies  rather  than  from  stimulus-response  contingencies,  as  was  the  ROC  curve 
presented  earlier.  This  difference,  from  the  standpoint  of  analysis,  is  by  no  means 
trivial.  The  method  used  by  Egan  is  really  a  two-stage  decision  process.  First,  the 
observer  has  to  select  (from  several  possibilities)  the  most  likely  word;  second,  he  must 
evaluate  this  decision  with  respect  to  all  other  possibilities.  Such  a  process  produces 
mathematical  expressions  virtually  impossible  to  evaluate  except  under  the  most  doubt- 
ful set  of  simplifying  assumptions. 

This  difficulty  does  not,  of  course,  prevent  one  from  summarizing  the  data 
presented  in  Fig.  5  by  a  single  parameter.  The  line  drawn  to  the  data  points  is  that 
generated  by  moving  a  criterion  along  two  normal  deviates  of  the  same  variance  which 
differ  only  in  means.  This  measure  was,  unfortunately,  initially  labeled  d'  because  of  its 
analogy  to  the  detection  measure.  It  is  unfortunate  because  the  detection  measure  d' 
has  often  been  specifically  related  to  physical  measurements  of  signal  and  noise.  No 

^^  Obviously  one  can  only  assume  it  is  constant  because  one  cannot  directly  measure 
probabilities  of  the  order  10"^.  If  one  is  not  willing  to  make  this  assumption,  one  must  raise  the 
false-alarm  rate  to  a  measurable  value,  Py(A)  >  10  \  or  use  one  of  the  other  techniques 
discussed  in  the  previous  section.  The  signal  energy  necessary  to  obtain  a  certain  d',  say 
d'  =  1,  could  then  be  used  as  the  counterpart  of  the  threshold  energy. 

-^  J.  P.  Egan,  Hearing  and  Communication  Laboratory,  Indiana  University,  Technical 
Report  under  contract,  1957. 


DAVID   M.   GREEN 


55 


1.00 


S  0.80  - 


■-  0.60 


0.40 


0.20 


1 

1 

1 

1        1    n^:     Ji; 

-t^ 

/ 

X 

y/ 

A 

r/ 

o 

- 

c^l 

-     /a 

TC  • 
FG  X 

+/ 

d/ 

/ 

/ 

ROC 

S/N  =  -  12  db 

RG  o 
GJ   A 
NK  + 
KS  a 

- 

/ 

1 

1 

III, 

1         1 

- 

0.00 


0.20  0.40  0.60  0.80 

P  (acceptance/ incorrect  identification) 


1.00 


Figure  5 
Some  data  taken  from  footnote  23.  The  signal-to-noise  ratio  refers  to  the  peak  signal  power 
of  the  word  compared  witn  the  noise  power.  The  points  represent  different  subjects.  The 
subject  listens  to  a  word  in  noise,  guesses  what  word  it  was,  and  then  grades  that  response  as 
either  being  correct  (acceptance)  or  incorrect  (rejection).  The  abscissa  and  ordinate  refer  to 
the  probability  of  acceptance  given  the  word  was  correctly  or  incorrectly  identified. 

such  identification  was  ever  intended  in  speech  work,  and  therefore  these  measures 
obtained  in  speech  research  are  presently  denoted  by  various  subscripts."'* 

The  importance  and  usefulness  of  such  measures  is  reviewed  thoroughly  in  the 
monograph  by  Egan^^  and  in  the  work  of  Pollack. ^^^^  Basically,  these  measures  are 
all  aimed  at  specifying  the  subject's  criterion.  For  an  interesting  example  of  how  this 
value  of  the  criterion  affects  the  substantive  conclusion  one  might  draw,  the  paper  by 
Pollack^^  is  recommended.  A  recent  paper  by  Clarke'-^  has  illustrated  how  confidence 
ratings  may  be  utilized  to  supplement  the  usual  articulation  index. 


-*  As  yet,  no  standard  notation  has  evolved.  The  following  list  of  references  contains 
many  of  the  proposals  that  have  been  advanced  to  clarify  this  confusion.  At  present,  one  must 
very  carefully  determine  how  the  detectability  measure  is  defined  in  each  experiment.  Even 
subscripted  measures,  d,'  in  particular,  are  defined  differently  in  different  experiments.    See 

F.  R.  Clarke,  T.  G.  Birdsall,  and  W.  P.  Tanner,  J.  Acoust.  Soc.  Am.,  1959,  31,  629;  J.  P.  Egan, 

G.  Z.  Greenberg,  and  A.  I.  Schulman,  Hearing  and  Communication  Laboratory,  Indiana 
University,  Technical  Report  under  contract,  1959;  and  I.  Pollack,  J.  Acoust.  Soc.  Am., 
1959,  31,  1031. 

25  I.  Pollack,/.  Acoust.  Soc.  Am.,  1959,  31,  1500. 

-'^  L.  R.  Decker  and  I.  Pollack,  J.  Acoust.  Soc.  Am.,  1959,  31,  1327. 

2"  I.  Pollack  and  L.  R.  Decker,  /.  Acoust.  Soc.  Am.,  1958,  30,  286. 

28  I.  Pollack,  /.  Acoust.  Soc.  Am.,  1959,  31,  1509. 

29  F.  R.  Clarke,  J.  Acoust.  Soc.  Am.,  1960,  32,  35. 
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Theory  of  Ideal  Observers 

In  the  most  general  sense,  an  ideal  observer  is  simply  a  function  relating  an 
observation  to  the  likelihood  of  that  observation.  Thus  we  have  already  specified  an 
ideal  observer  for  our  simple  example,  since  Table  I  accomplishes  this  task.  This  is  not 
an  interesting  example,  however,  because  the  observations  were  already  specified  in  terms 
of  the  probabilities  under  each  hypothesis.  A  more  interesting  example  of  an  ideal 
observer  arises  where  the  observations  are  waveforms  and  where  the  characteristics  of 
the  waveform  differ  under  each  hypothesis.  The  task  of  the  ideal  observer  is,  then,  given 
a  waveform,  calculate  likelihood  ratio  or  some  monotonic  transformation  of  that 
quantity. 

The  ideal  observer,  strictly  speaking,  need  not  make  any  decisions.  If  likelihood 
ratio  is  computed,  the  problem  of  what  decision  rule  to  employ  is  determined  by  the 
specific  objective  in  making  the  decisions.  Various  possible  objectives  have  been  dis- 
cussed in  the  previous  sections,  where  it  was  pointed  out  that  these  objectives  could  be 
attained  by  using  a  decision  rule  based  on  likelihood  ratio.  Although  the  calculation  of 
likelihood  ratio  specifies  the  ideal  observer  for  a  given  problem,  such  information  is  of 
little  value  unless  we  can  evaluate  this  observer's  performance.  One  general  method  of 
evaluating  the  ideal  observer's  performance  is  to  determine  /?0C  curves,  but  to  obtain 
an  ROC  curve  we  must  calculate  two  probabilities.  Thus  to  evaluate  completely  the 
ideal  observer  we  actually  have  to  specify  not  only  how  likelihood  is  calculated  but  the 
probability  distribution  of  likelihood  ratio  on  both  hypotheses. 

Having  established  the  general  background  of  this  problem,  let  us  consider  a 
specific  example:  the  ideal  observer  for  conditions  of  a  signal  which  is  known  exactly. 

Ideal  observer  for  the  signal  known  exactly  (SKE) 

Two  hypotheses  actually  define  this  special  case  in  which,  given  a  waveform, 
one  must  select  one  of  the  following  hypotheses: 

Hi — the  waveform  is  a  sample  of  white  Gaussian  noise  n(t)  with  specified 
bandwidth  {W)  and  noise  power  density  (A'^q)- 

//g — the  waveform  is  n(t)  plus  some  specified  signal  waveform  sit).  Everything 
is  known  about  s{t)  if  it  occurs:  its  starting  time,  duration,  and  phase.  It 
need  not  be  a  segment  of  a  sine  wave  as  long  as  it  is  specified,  i.e.,  known 
exactly. 

From  these  two  hypotheses  we  wish  to  calculate  likelihood  ratio,  and,  if  possible,  derive 
the  probability  distribution  of  likelihood  ratio  on  both  hypotheses.  Obviously  such 
calculations  will  be  of  little  use  unless  the  final  results  can  be  fairly  simply  summarized 
in  terms  of  some  simple  physical  measurement  of  signal  and  noise.  Happily,  such  is  the 
case. 

We  shall  not  present  the  derivation  here  since  it  is  not  in  itself  particularly 
instructive  and  can  be  obtained  elsewhere.'^  One  assumption  of  the  derivation  will, 
however,  be  discussed,  since  an  objection  to  this  assumption  has  been  recently  raised; 
an  objection  which  seriously  questions  the  legitimacy  of  applying  this  result  to  any 
psychoacoustic  experiment  which  has  yet  been  conducted.  Unfortunately,  the  alter- 
native assumption  suggested  has  a  different  but  equally  serious  flaw. 
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Representation  of  the  waveform 

The  assumption  concerns  the  representation  of  the  waveform.  In  order  to  com- 
pute likelihood  ratio,  one  must  find  the  probability  of  a  certain  waveform  on  each 
hypothesis.  Since  the  waveform  is  simply  a  function  of  time,  one  must  somehow  associate 
a  probability  with  this  waveform,  or  somehow  obtain  a  set  of  measures  from  the  wave- 
form and  associate  a  probability  with  these  measures. 

But  what  exactly  is  the  nature  of  the  waveform?  In  order  to  compute  these 
various  probabilities  we  must  make  some  very  specific  assumptions  about  the  class 
of  waveforms  we  will  consider. 

Peterson,  Birdsall,  and  Fox^  assumed  that  the  waveforms  were  Fourier  series- 
band  limited.  If  the  waveform  is  of  this  class  It  can  be  represented  by  //  =  2WT 
measures,  where  ff  is  the  "bandwidth"  of  the  noise  and  Tis  the  duration  of  the  wave- 
form. A  series  representation  in  terms  of  sine  and  cosine  might  be  used.  There  are 
of  course,  many  equivalent  ways  of  writing  this  series  to  identify  the  n  parameters,  but 
these  are  all  unique,  and  if  the  original  waveform  is  indeed  Fourier  series-band  limited, 
they  will  reproduce  exactly  the  waveform  in  the  interval  (0,  T).  Accepting  this  assump- 
tion, we  find  that  a  monotonic  transformation  of  likelihood  ratio  (the  logarithm)  is 
normal  under  both  hypotheses. 

H^:  log/(x)  is  normal  with  mean  —EJNq,  variance  EjN^^, 
H2,'.  log/G^)  is  normal  with  mean  +EINq,  variance  EJNq, 

d'  =  AMjo  =  (IE/Nq)^  where  E  is  the  signal  energy,  J^[5(/)]^  dt,  and  Nq  is  the  noise 
power  density.  Naturally,  if  this  assumption  about  the  waveform  is  not  made  the 
preceding  result  is  invalid.  Mathews  and  David^"  have  considered  a  slightly  different 
assumption.  They  assumed  the  waveforms  are  Fourier  integral-band  limited.  The 
conclusion  resulting  from  this  assumption  is  that  the  signal  is  perfectly  detectable  in  the 
noise  independent  of  the  ratio  £'/A^0'  ^s  long  as  it  is  not  zero.  In  short,  d'  is  infinite 
for  any  nonzero  value  of  E/Nq.  Which  of  these  assumptions  is  the  more  reasonable  or 
applicable  to  a  psychoacoustic  experiment? 

Neither  assumption  can  be  completely  justified.  In  almost  all  psychoacoustic 
experiments,  the  noise  voltage  is  actually  produced  by  a  special  tube.  The  voltage 
produced  by  this  tube  is  amplified  and  filtered.  Such  noise  is  not  Fourier  series-band 
limited,  for  the  noise  is  clearly  not  periodic. ^^   Although  a  Fourier  series  might  serve 

3»  M.  V.  Mathews  and  E.  E.  David,  J.  Acoust.  Soc.  Am.,  1959,  31,  834(A). 

^^  It  is  somewhat  unfair  to  imply  that  Peterson,  Birdsall,  and  Fox  assumed  the  noise  was 
periodic.  Their  assumption,  strictly  speaking,  was  that  each  waveform  could  be  represented  by 
a  finite  set  of  numbers.  The  way  they  obtained  these  numbers  is  through  a  sampling  plan,  which 
we  cannot  discuss  in  detail.  It  was  not  a  simple  Fourier  expansion  in  terms  o^  sine  and  cosine. 
This  is  a  difficult  and  complex  topic;  for  a  discussion  of  the  details  in  this  area  see  footnote  7; 
D.  Slepian,  "Some  comments  on  the  detection  of  Gaussian  signals  in  Gaussian  noise,"  Trans. 
IRE,  PGIT-4,  65  (1958);  and  W.  B.  Davenport  and  W.  L.  Root,  Random  signals  in  noise.  New 
York:  McGraw-Hill,  1958.  Precise  analysis  of  the  situation  where  the  noise  is  filtered,  i.e., 
where  the  power  spectrum  of  the  noise  is  a  polynomial,  can  be  worked  out  in  principle.  The 
analysis  is  complex  and  exact  answers  can  be  obtained  only  in  certain  simple  cases.  One  can 
show  in  general,  however,  that  for  practical  situations  the  detectability  of  the  signal  is  finite. 
(See  Davenport  and  Root.) 
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as  an  excellent  approximation  to  these  waveforms  in  the  interval  (0,  T),  it  would  not 
be  an  exact  representation  of  the  waveform.  Similarly,  an  assumption  of  a  Fourier 
integral  limitation  of  the  bandwidth  cannot  be  correct,  because  the  waveform  does  not 
have  a  sharp  cutoff  in  the  Fourier  integral  sense.  If  it  did,  the  waveform  would  be 
analytic.  If  it  were  analytic,  the  ideal  observer  could  sample  at  one  point  in  time,  obtain 
all  the  derivatives  at  that  point,  and  know  the  exact  form  of  the  wave  for  all  time.  Such 
a  result  leads  to  the  conclusion  that  the  ideal  observer,  by  observing  one  sample  of  the 
waveform  at  any  time  can,  immediately,  in  principle,  make  his  decision  about  all  the 
waveforms  the  experimenter  has  presented  in  the  past  and  all  those  he  may  ever  decide 
to  produce.  This  approach  is  therefore  of  little  practical  use. 

The  issue,  while  obviously  only  an  academic  one,  has  indicated  one  very 
important  aspect  of  the  problem.  The  ideal  observer  is,  like  all  ideal  concepts,  only  as 
good  as  the  assumptions  that  generate  it.  Clearly,  any  such  idealization  of  a  practical 
situation  is  based  on  certain  simplifying  assumptions.  It  is  always  extremely  important 
to  understand  what  these  assumptions  are  and  even  more  important  to  realize  the 
implications  of  a  change  in  these  assumptions.  In  short,  there  are  many  ideal  observers, 
each  generated  by  certain  key  assumptions  about  the  essential  nature  of  the  detection 
task. 

For  the  discussion  which  follows,  we  shall  use  the  Peterson,  Birdsall,  and  Fox' 
approach  and  assume  that  the  waveforms  can  be  completely  represented  by  a  finite 
number  of  measurements.  A  similar  treatment  is  given  by  Van  Meter  and  Middleton.^^ 
As  more  progress  is  made  with  the  theory  of  ideal  observers  we  should  be  able  to  state 
quite  precisely  how  detection  will  vary  if  certain  definite  restrictions  are  imposed  on  the 
manner  in  which  the  observer  operates.  Peterson,  Birdsall,  and  Fox  have,  in  fact,  con- 
sidered several  such  cases  and  their  results.  Each  case  provides  us  with  a  framework 
from  which  we  may  evaluate  and  assess  the  performance  of  the  subject.  Such  a  com- 
parison provides  both  qualitative  aud  quantitative  guides  for  further  research. ^^ 
There  are  several  areas  we  might  select  to  illustrate  this  approach.  The  one  we  have 
selected  was  chosen  because  it  is  a  general  topic  and  because  it  has  been  slighted  some- 
what in  psychoacoustics. 

Shape  of  the  psychophysical  function 

The  psychophysical  function  is  generally  defined  as  the  curve  relating  the  per- 
centage of  correct  detections  of  the  signal  (the  ordinate)  to  some  physical  measure  of  the 
signal  (the  abscissa).  If  some  variant  of  the  constant  stimuli  method  is  used,  the  curve 
rises  monotonically  from  zero  to  one  hundred  percent  as  the  signal  level  is  increased. 

Generally,  hypotheses  about  the  form  of  this  function  arise  from  assumptions 
about  the  process  of  discrimination.  Often  these  assumptions  are  sufficient  to  allow 
one  to  deduce  the  form  of  the  psychophysical  function  to  within  two  or  three  param- 
eters which  are  then  determined  experimentally.  Obviously,  it  is  extremely  important 
for  the  model  to  specify  the  exact  transformation  of  the  physical  stimulus  which  is  used 
as  the  abscissa  of  the  psychophysical  function;  without  such  specification,  the  theory 
is  incomplete. 

In  psychoacoustics,  there  has  been  comparatively  little  concern  with  the  form  of 
this  function.  Most  theories  of  the  auditory  process  have  been  content  with  attempting 

»-  D.  Van  Meter  and  D.  Middleton,  Trans.  IRE,  1954,  PGIT-4,  119. 

33  W.  P.  Tanner  and  T.  G.  Birdsall,  /.  Acoust.  Soc.  Am.,  1958,  30,  922. 
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to  predict  only  one  parameter  of  the  psychophysical  curve,  usually  the  mean  or 
threshold.  As  a  result,  it  is  nearly  impossible  to  obtain  from  the  literature  information 
on  the  actual  form  of  the  psychophysical  function. 

The  notable  exception  to  the  preceding  statement  is  the  neural-quantum 
hypothesis.^*  The  authors  of  this  theory  say  that  it  "enables  us  to  predict  the  form  and 
the  slope  of  certain  psychometric  functions."  It  can  be  demonstrated  from  the  model 
that  the  form  of  the  function  should  be  linear  and  this  linear  function  is  specified  to 
within  one  parameter.  The  physical  measure  is  never  mentioned  in  the  derivation  of  the 
theory  and  we  find  only  after  the  data  are  presented  that  sound  pressure  and  frequency 
are  the  appropriate  physical  measures.  The  authors  remark  in  their  paper  that  "strictly 
speaking,  data  yielding  rectilinear  psychometric  functions  when  plotted  against  sound 
pressure  do  not  show  absolute  rectilinearity  when  expressed  in  terms  of  sound  energy, 
but  calculation  shows  that  the  departure  from  rectilinearity  is  negligible."  It  is  certainly 
true  that  pressure,  pressure  squared,  and  indeed  pressure  cubed,  are  all  nearly  linear 
for  small  values  of  pressure — but  that  is  not  entirely  the  point. 

It  is  the  location  of  this  function  that  plays  a  crucial  role  in  the  theory.  If  the 
subject  employs  a  two-quantum  criterion  then,  according  to  the  theory,  the  psycho- 
physical function  must  be  zero  up  to  one  quantum  unit,  show  a  linear  increase  to  one 
hundred  percent  at  two  quantum  units,  and  maintain  this  level  for  more  quantum  units. 
Where  the  curve  breaks  from  zero  percent  reports  and  where  it  reaches  one  hundred 
percent  reports  is  precisely  specified  by  the  theory.  In  general,  if  the  subject  requires 
n  quanta  to  produce  a  positive  report,  the  increasing  linear  function  must  extend  from 
n  io  n  +  \  quantum  units.  Now  clearly,  what  appears  to  be  a  two-quantum  subject 
(0%  at  one  pressure  unit,  100%  at  two  pressure  units),  when  the  data  are  plotted  in 
pressure  units,  cannot  be  interpreted  as  a  two-quantum  subject  in  energy  units.  In  fact, 
he  cannot  be  interpreted  as  an  any-number-of-quantum  subject.  This  is  true  no  matter 
how  small  the  values  of  pressure. 

This  criticism  of  the  rather  post  hoc  treatment  of  the  physical  scale  is  by  no 
means  limited  to  the  neural-quantum  hypothesis.  Many  hypotheses  about  the  shape 
ofthe  psychophysical  function,  including  some  formulations  ofthe  Gaussian  hypothesis, 
neglect  this  rather  crucial  factor. 

Detection  theory  stands  in  marked  contrast  with  these  theories.  Models  based 
on  the  ideal  observer  concept  predict  the  form  of  the  psychophysical  function  exactly. 
The  proper  physical  dimensions  are  completely  specified  and  there  are  no  free  param- 
eters. 

Obviously,  one  would  not  be  surprised  to  find  human  observers  somewhat  less 
than  optimum,  but  hopefully,  the  shape  of  the  psychophysical  function  might  at  least 
be  parallel  to  that  obtained  from  the  model.  Often  however,  the  obtained  psycho- 
physical function  does  not  parallel  that  predicted  by  the  model  and  this  discrepancy 
deserves  some  discussion. 

Signal  uncertainty  and  ideal  detectors.^'"  In  Fig.  6,  we  have  plotted  the  per- 
centage of  correct  detections  in  a  two-alternative  forced-choice  procedure  versus 

3*  S.  S.  Stevens,  C.  T.  Morgan,  and  J.  Volkmann,  Am.  J.  Psycho!.,  1941,  54,  315. 

^^  The  analysis  of  detection  data  from  the  viewpoint  of  signal  uncertainty  isvery  similar 
to  some  ideas  expressed  by  Dr.  W.  P.  Tanner.  Although  several  details  of  the  analysis  differ, 
the  essentials  are  the  same.  The  author  is  indebted  to  Dr.  Tanner  for  many  long  and  lively 
conversations  on  this  topic. 
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Figure  6 
The  theoretical  psychophysical  functions  for  the  ideal  observer  detecting  1  of  M-orthogonal 
signals.  The  parameter  M  is  the  number  of  possible  orthogonal  signals.  The  ideal  detector 
need  only  detect  the  signal,  not  identify  it.  The  abscissa  is  ten  times  the  logarithm  of  signal 
energy  to  noise-power  density.  The  ordinate  is  the  percent  correct  detection  in  a  two-alternative 
forced-choice  test.  The  obtained  data  are  compared  with  the  theoretical  function  shifted  about 

10  db  to  the  right. 

<^-^o  for  a  typical  subject  and  a  series  of  mathematical  models.  The  problem  in  all  cases 
is  simply  to  detect  a  sinusoidal  signal  added  to  a  background  of  white  noise. 

We  say  "typical  subject"  because  the  shape  of  this  function  is  remarkably  in- 
variant over  both  subjects  and  a  range  of  physical  parameters.  For  signal  durations 
of  10  to  1000  msec^^  and  signal  frequencies  from  250  to  4000  cps,^^  there  appears  to  be 
no  great  change  in  the  shape  of  the  function  when  plotted  against  the  scale  shown  in 
Fig.  6.  Naturally,  the  exact  location  of  the  curve  depends  on  the  exact  physical  param- 
eters of  the  signal,  but  except  for  this  constant,  which  is  a  simple  additive  constant  in 
logarithmic  form,  the  shape  is  remarkably  stable.  The  striking  aspect  of  this  function 
is  its  slope.  We  notice  the  slope  of  the  observed  function  is  steeper  than  most  of  the 
theoretical  functions  depicted  in  Fig.  6. 

The  class  of  theoretical  functions  is  generated  by  assuming  the  detector  has 
various  uncertainties  about  the  exact  nature  of  the  signal. ^^  Each  function  is  generated 
by  assuming  the  detector  knows  only  that  the  signal  will  be  one  of  M-orthogonal  signals. 
If  the  signal  is  known  exactly  (M  =  1)  there  is  no  uncertainty.  For  sinusoidal  signals, 
the  nature  of  the  uncertainty  might  be  phase,  time  of  occurrence  of  the  signal,  or 
signal  frequency.  The  degree  of  uncertainty  is  reflected  by  the  parameter  M.    As  this 

38  D.  M.  Green,  /.  Acoust.  Soc.  Am.,  1959,  31,  836(A). 

"  D.  M.  Green,  M.  J.  McKey,  and  J.  C.  R.  Licklider,  J.  Acoust.  Soc.  Am.,  1959,31, 1446. 

^^  The  details  of  this  model  may  be  found  in  footnote  7,  p.  207.  This  particular  model 
was  selected  because  it  has  been  presented  in  the  literature.  There  are  other  models  which 
assume  signal  uncertainty  but  which  differ  in  details  about  the  decision  rule.  The  psycho- 
physical functions  produced  by  these  models  are  similar  to  those  displayed  in  Fig.  6,  although 
the  value  of  the  parameter  (M)  would  be  changed  somewhat. 
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uncertainty  increases,  the  psychophysical  function  increases  in  slope.  It  therefore 
appears  that  there  may  exist  a  model  with  sufficient  uncertainty  about  the  signal  to 
generate  a  function  which  is  very  similar  to  that  displayed  by  the  human  observer. 

Accepting  for  the  moment  the  assumption  that  the  extreme  slope  of  the  human 
observer's  psychophysical  function  is  due  to  some  degree  of  uncertainty  about  the 
signal,  we  might  try  to  manipulate  this  slope  by  various  experimental  procedures. 

Preview  technique.  One  general  class  of  procedures  would  attempt  to  reduce 
the  uncertainty  by  supplying  the  missing  information  through  some  form  of  cueing  or 
preview  technique.  If,  for  example,  the  observer  is  uncertain  about  the  frequency  of  the 
signal  we  might  attempt  to  reduce  this  uncertainty  by  presenting  the  signal  briefly  at  a 
high  level  just  prior  to  the  observation  interval.  Similarly,  if  the  time  of  occurrence  of 
the  signal  is  uncertain  we  might  increase  the  noise  during  the  observation  interval.  If 
the  noise  was  increased  for  all  trials,  whether  or  not  the  signal  was  presented,  it  would 
provide  no  information  about  the  signal's  presence  but  would  convey  direct  informa- 
tion about  the  signal's  starting  time  and  duration.  Both  of  these  techniques  have  been 
utilized  with  only  partial  success.  While  it  is  impossible  to  assert  that  there  was  no 
change  (the  null  hypothesis)  the  amount  of  change  was  very  small,  although  in  the 
proper  direction. ^^ 

Another  class  of  procedures  which  has  been  utilized  to  attempt  to  reduce  the 
subject's  uncertainty  about  the  signal  parameters  involves  changing  the  detection  task 
so  that  some  information  is  directly  supplied.  The  procedures  are  like  the  preceding 
but  actually  include  the  information  in  the  observation  interval.  For  example,  to 
remove  frequency  uncertainty,  we  might  add  a  continuous  sine  wave  to  the  noise.  The 
continuous  sine  wave  is  adjusted  to  a  level  such  that  it  is  clearly  evident  in  the  noise. 
The  signal  is  an  increment  added  to  this  sine  wave  and  the  task  is  to  detect  this  incre- 
ment. The  procedure  definitely  changes  the  slope  of  the  subject's  psychophysical 
function — it  becomes  less  steep  and  the  signal  is  easier  to  detect.^'' 

This  procedure  of  making  the  signal  an  increment  to  a  continuous  sine  wave 
provides  good  frequency  information  but  does  not  remove  temporal  uncertainty. 
Another  procedure  which  minimizes  practically  all  uncertainty  is  in  fact  a  modification 
of  a  standard  procedure  used  to  investigate  the  j'.n.d.  for  intensity.  A  two-alternative 
forced-choice  procedure  is  employed.  Two  gated  sinusoids  occur  in  noise,  one  at 
standard  level,  the  other  at  this  level  plus  an  increment.  The  subject's  task  is  to  select 
the  interval  containing  the  increment.  If  the  standard  signal  is  adjusted  to  a  power 
level  about  equal  to  the  noise-power  density,  the  psychophysical  function  actually 
parallels  that  expected  for  the  signal-known-exactly  case."*^  It  is  from  3  to  6  db  off 
optimum  in  absolute  value,  depending  on  the  energy  of  the  standard.  (See  Fig.  7. 
Note  the  change  in  scale  between  Figs.  6  and  7.) 

Let  us,  at  least  tentatively,  accept  as  the  conclusion  of  these  last  results  that  the 
shape  of  the  psychophysical  function  is  in  fact  due  primarily  to  various  uncertainties 
about  the  signal  parameter.  If  this  is  true,  then  we  still  have  the  problem  of  explaining 

3*  Unpublished  work  of  the  author.  Also  see  T.  Marill,  Ph.D.  thesis,  Massachusetts 
Institute  of  Technology,  1956,  and  J.  C.  R.  Licklider  and  G.  H.  Flanagan,  "On  a  methodo- 
logical problem  in  audiometry,"  unpublished. 

*"  W.  P.  Tanner,  J.  Bigelow,  and  D.  M.  Green,  unpublished. 

"  W.  P.  Tanner,  Electronic  Defense  Group,  University  of  Michigan,  Technical  Report 
No.  47.  1958. 
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the  lack  of  success  evidenced  when  the  previous  techniques  were  employed.  Should 
not  a  preview  of  the  signal,  preceding  an  observation,  serve  to  reduce  frequency  un- 
certainty? The  answer  might  be  that  such  procedures  do  reduce  uncertainty,  but  not 
enough  relative  to  the  uncertainty  still  remaining.  From  Fig.  6  we  note  that,  as  we 
introduce  signal  uncertainty,  the  slope  of  the  psychophysical  function  increases  very 
rapidly  for  small  changes  in  uncertainty;  then,  as  the  uncertainty  increases,  the  slope 
approaches  some  asymptotic  value.  A  change  in  uncertainty  from  M  =  256  to  64  may 
hardly  affect  the  psychophysical  function.  This  fact  also  probably  explains  why  the 
psychophysical  functions  do  not  appear  to  change  very  much  for  a  variety  of  signal 
parameters,  such  as  signal  duration  and  signal  frequency.  Undoubtedly,  as  the  signal 
duration  increases,  the  uncertainty  about  the  time  of  occurrence  of  the  signal  is  re- 
duced. Due  to  the  large  initial  uncertainty,  this  change  is  too  small  to  be  detected  in 
the  data. 

Uncertain  signal  frequency.  Still  another  manner  of  checking  this  general 
model  is  to  vary  the  uncertainty  of  the  signal  and  determine  how  this  affectsthe  subject's 
performance.  One  might,  for  example,  select  several  different  sinusoidal  signals  and 
select  one  at  random  as  the  signal  used  on  a  particular  trial.  The  subject  is  simply  asked 
to  detect  a  signal,  not  identify  it.  Depending  on  the  frequency  separation  and  the 
number  of  signals  used,  one  can  directly  manipulate  signal  uncertainty. 
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Figure  7 
Observed  data  in  the  A/ versus  /experiment  and  the  signal-known-exactly  observer  (M  =  1). 
The  abscissa  and  ordinate  are  the  same  as  in  Figure  6,  but  note  the  change  in  scale  of  the 
abscissa.  The  two  curves  differ  by  6  db  at  each  value  of  percent  correct.  The  apparent  con- 
vergence of  the  two  curves  at  low  values  of  percent  correct  is  illusory.  The  insert  shows  the 
level  of  the  noise;   the  lines  show  the  level  of  /  in  power,  and  the  maximum  /  -)-  A/  power . 
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Figure  8 
The  variation  of  signal-to-noise  level  for  some  constant  percent  correct  as  a  function  of  M. 
This  curve  is  the  same  information  presented  in  Figure  6  with  M  as  the  variable  and  percent 

correct  at  the  parameter. 

This  in  fact  was  a  procedure  used  in  an  earlier  study  by  Tanner  et  al}^  A  small 
decrement  (1.0  to  1.5  db)  in  detectability  was  found  if  one  compared  a  situation  where 
a  single  fixed  sinusoid  was  the  signal  and  a  situation  where  the  signal  was  one  of  two 
sinusoids.  Later  results*^"^*  show,  however,  that  the  decrement  did  not  increase  very 
much  as  more  components  were  included  in  the  set  of  possible  signals.  This  result  is 
consistent  with  the  theoretical  model  we  have  been  discussing.  Figure  8  shows  how, 
for  a  constant  detectability,  one  must  change  the  signal  level  as  uncertainty  (M)  is 
increased.  The  decrement  in  signal  detectability  as  a  function  of  signal  uncertainty 
changes  very  slowly  after  M  reaches  a  value  of  50  or  so.  The  1.5  db  per  octave  decre- 
ment, suggested  by  some  of  the  earlier  models  to  account  for  the  uncertain  frequency 
data*^  is  only  a  reasonable  approximation  for  a  rather  limited  range  of  M.*^ 

While  the  preceding  argument  that  the  shape  of  the  psychophysical  function  is  largely 
due  to  signal  uncertainty  has  some  appeal,  there  still  remain  some  problems  with  this 
interpretation.  Another  way  to  attack  this  problem  of  signal  uncertainty  is  to  use  a 
signal  where  little  information  about  the  waveform  is  known,  and  compare  the  subject's 
performance  with  the  theoretical  optimum  model  in  this  situation.  A  specific  case 
arises  where  the  signal  is  a  sample  of  noise.  The  most  one  can  specify  about  the  signal 
is  the  frequency  region,  starting  time,  duration,  and  power.  The  ideal  detector  for  this 
signal  can  be  specified — it  simply  measures  signal  energy  in  the  signal  band.    But  the 

*-  F.  A.  Veniar,  /.  Acoust.  Soc.  Am.,  1958,  30,  1020. 

"  F.  A.  Veniar,  /.  Acoust.  Soc.  Am.,  1958,  30,  1075. 

**  C.  D.  Creelman,  Electronic  Defense  Group,  University  of  Michigan,  Technical 
Memo.  No.  71,  1959. 

*^  D.  M.  Green,  /.  Acoust.  Soc.  Am.,  1958,  30,  904.   See  also  footnotes  16  and  44. 

"  J.  P.  Egan,  G.  Z.  Greenberg,  and  A.  I.  Schulman,  J.  Acoust.  Soc.  Am.,  1959,  31, 
1579(A).  Egan  et  al.  have  investigated  how  temporal  uncertainty  affects  signal  detectability. 
In  one  condition  they  present  a  fixed-frequency  sinusoidal  signal  of  0.25  sec  duration  some- 
where in  an  8-sec  interval.  They  did  not  report  the  results  in  detail,  but  the  decrement  in 
detectability  due  to  temporal  uncertainty  was  small  (1  or  2  db). 
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psychophysical  functions  obtained  with  this  type  of  signal  are  also  slightly  steeper  than 
those  predicted  by  the  model.^''  Either  partial  time  uncertainty  still  remains  or  signal 
uncertainty  alone  is  not  a  sufficient  explanation.  The  author  feels  that  a  better  model 
would  assume  that  the  human  observer  utilizes  some  nonlinear  detection  rule.  This 
assumption,  coupled  with  the  uncertainty  explanation,  could  probably  explain  most  of 
the  results  obtained  thus  far.  The  mathematical  analysis  of  such  devices,  is  however, 
complex. 

Internal  noise.  Before  summarizing,  one  final  point  must  be  considered. 
Often  it  is  a  temptation  to  invoke  the  concept  of  internal  or  neural  noise  when  dis- 
cussing the  discrepancy  between  an  ideal  model  and  the  human  observer.  There  are 
good  reasons  for  avoiding  this  temptation.  While  it  would  take  us  too  far  afield  to 
cover  this  point  in  detail,  the  following  remarks  will  illustrate  the  point. 

Only  if  the  model  is  of  a  particularly  simple  form  can  one  hope  to  evaluate  the 
specific  effects  of  the  assumption  of  internal  noise.  The  signal-known-exactly  observer 
is  of  this  type.  Here  one  can  show  how  a  specific  type  of  internal  noise  can  simply  be 
treated  as  adding  noise  at  the  input  of  the  detection  device.  Thus  one  can  evaluate  the 
psychophysical  function  and  it  will  be  shifted  to  the  right  by  some  number  of  decibels 
(see  Fig.  6)  due  to  the  internal  noise.  But,  of  course,  such  an  assumption  can  immedi- 
ately be  rejected  since  no  shift  in  the  psychophysical  function  can  account  for  the  data 
displayed  in  the  figure. 

With  more  complicated  models,  it  is  usually  difficult  to  say  exactly  what  internal 
noise  will  do.  While  it  will  obviously  lower  discrimination,  the  specific  effects  of  the 
assumption  are  often  impossible  to  evaluate.  Unless  these  specific  effects  can  be 
evaluated,  the  assumption  simply  rephrases  the  original  problem  of  the  discrepancy. 

I  am  not  suggesting  that  the  human  observer  is  perfect  in  any  sense,  nor  attempt- 
ing to  minimize  the  importance  of  the  concept  of  internal  noise.  What  I  am  emphasiz- 
ing is  that  the  concept  must  be  used  with  great  care.  If  the  concept  is  to  have  any 
importance  it  must  be  made  specific.  This  implies  that  we  have  to  (1)  state  exactly 
what  this  noise  is,  i.e.,  that  we  have  to  characterize  it  mathematically,  (2)  specify  in 
what  way  it  interacts  with  the  detection  or  discrimination  process,  and  (3)  evaluate 
specifically  what  effect  it  will  have  on  performance.  Unless  these  steps  can  be  carried 
out  the  ad  hoc  nature  of  the  assumption  vitiates  its  usefulness. 

Summary  and  Conclusion 

The  main  emphasis  in  this  paper  has  been  to  explain  detection  theory  and  to 
illustrate  how  such  a  theory  has  been  applied  to  certain  areas  of  psychoacoustics. 
This  method  of  analysis  is  simply  one  of  many  that  are  currently  being  used  in  an  attempt 
to  understand  the  process  of  hearing. 

Two  main  aspects  of  this  approach  have  been  distinguished.  The  first,  decision 
theory,  emphasizes  that  the  subject's  criterion  as  well  as  the  physical  properties  of  the 
stimulus  play  a  major  role  in  determining  the  subject's  responses.  The  theory  indicates 
both  the  class  of  variables  which  determines  the  level  of  the  criterion,  and,  more 
importantly,  suggests  an  analytic  technique  for  removing  this  source  of  variation.  This 
technique  leaves  a  relatively  pure  measure  of  the  detectability  of  the  signal.  The 
invariance  of  this  measure  over  several  psychophysical  procedures  has  already  been 
demonstrated. 

*'  D.  M.  Green,  /.  Acoust.  Soc  Am.,  1960,  32,  121. 
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Figure  9 
The  normalized  expected  value  as  a  function  of  changes  in  criterion.    This  is  a  theoretical 
curve  based  on  the  data  presented  in  Figures  2  and  3.  The  appendix  lists  the  assumptions  used 

to  construct  the  curve. 

The  second  aspect,  the  theory  of  ideal  observers,  has  also  been  discussed  in  some 
detail.  The  usefulness  of  such  an  analysis  was  illustrated  by  considering  the  form  of  the 
psychophysical  function.  No  ideal  observer  provides  a  complete  or  comprehensive 
model  even  for  the  rather  limited  areas  of  psychoacoustics  that  we  have  discussed  in  this 
paper.  The  model  provides  a  source  of  hypotheses  and  a  standard  against  which  experi- 
mental results  can  be  evaluated.  It  is  too  early  to  attempt  any  complete  evaluation  of 
this  approach.  The  mathematical  models  are  relatively  new  and  the  application  of  these 
models  to  a  sensory  process  began  with  Tanner  and  Swets'*^  only  about  five  years  ago. 
There  remain  many  problems  to  be  solved  both  of  a  mathematical  and  experimental 
nature.  As  more  progress  is  made  in  both  areas,  the  theory  should  become  more  specific 
and  concrete,  then  perhaps  it  will  be  able  to  interact  more  directly  with  the  research 
from  several  other  areas  in  psychoacoustics. 

Appendix  A 

The  inherent  difficulty  of  comparing  the  optimum  criterion  value  and  that 
employed  by  the  subject  is  the  shape  of  the  expected-value  function.  Let  us  investigate 
in  detail  a  typical  situation.  We  have  assumed  that  the  distribution  on  likelihood  ratio 
is  normal  under  both  hypotheses,  that  the  mean  separation  is  one  sigma  unit,  and 
that  the  values  and  costs  of  the  various  decision  alternatives  are  all  the  same.  From 
these  assumptions  we  have  constructed  Fig.  9.  This  figure  shows  how  the  expected 
value  varies  with  changes  in  a  priori  probability  of  signal  PCSN)  and  false-alarm  rate 
P^{A).  We  see  immediately  that  for  extreme  values  oi  a  priori  probability,  e.g..  P(SN)  = 
0.10,  the  difference  between  optimum  expected-value  behavior  {P^{A)  =  0.004]  and 

"  W.  P.  Tanner  and  J.  A.  Swets,  Psychol.  Rev.,  1954,  61,  401. 
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a  pure  strategy  [P^(A)  =  0.000]  is  less  than  3  %.  In  fact,  the  curves  in  the  figure  were 
somewhat  exaggerated  to  allow  one  to  see  the  location  of  the  maximum.  Since  most 
subjects  are  instructed  to  avoid  pure  strategies  in  psychoacoustic  experiments,  this 
tends  to  force  the  subject  to  adopt  more  moderate  values  of  P^(A)  for  extreme  condi- 
tions. 

On  the  other  hand,  if  more  moderate  a  priori  probabilities  are  employed  in  the 
experiment  [e.g.,  P(SN)  =  0.50],  we  see  that  any  value  of  P^(A)  within  a  range  from 
0.15  to  0.50  will  achieve  at  least  90%  of  the  maximum  expected  payoff. 

Thus  any  attempt  to  investigate,  in  any  more  than  a  correlational  sense,  the 
correspondence  between  obtained  and  optimum  criteria  appears  extremely  difficult. 

Received  June  23,  1960. 


SOME  COMMENTS  AND  A  CORRECTION  OF 
"PSYCHOACOUSTICS  AND  DETECTION  THEORY"* 

David  M.  Green 

department  of  economics  and  research  laboratory  of  electronics, 

massachusetts  institute  of  technology,  cambridge,  massachusetts 

Dr.  S.  S.  Stevens  has  very  kindly  pointed  out  two  items  in  my  paper,  "Psycho- 
acoustics  and  Detection  Theory,"^  that  require  further  comment  in  order  to  avoid 
misunderstand!  ng . 

I  called  the  function  relating  the  percentage  of  correct  detection  responses  to  the 
physical  intensity  of  the  stimulus  the  psychophysical  function.  It  is  true  that  this 
function  is  more  often  called  the  psychometric  function,  a  term  probably  introduced 
by  Urban  in  1908.2 

Originally  Fechner  added  up  successive  just-noticeable-differences  (jnd's)  to 
determine  the  relation  between  the  magnitude  of  sensation  and  the  physical  intensity 
of  the  stimulus.  The  resulting  relation  is  commonly  called  the  psychophysical  function. 
Since  Fechner's  time  many  other  techniques  for  determining  this  relation  have  been 
devised  and  the  results  are  also  called  psychophysical  functions  (e.g.,  Stevens'  power 
law^).  The  newer  methods  do  not  involve  determining  jnd's  and  are  not  obtained  by 
using  any  simple  variant  of  the  classical  methods  of  psychophysics.  We  are  therefore 
faced  with  the  anomaly  that  psychometric  functions  are  obtained  by  using  psycho- 
physical methods  and  psychophysical  functions  are  now  determined  by  other,  different 
techniques. 

Personally  I  find  the  designation  used  in  vision — frequency-of-seeing  curve — 
even  more  distasteful  than  the  term  psychometric  function.  Some  change  in  ter- 
minology would  be  most  welcome.   I  am  open  for  suggestions. 

The  second  item  is  more  crucial  and  concerns  my  remarks  about  the  neural- 
quantum  theory.  I  asserted  that  data  that  appear  to  indicate  a  two-quantum  observer 
when  plotted  against  pressure  units  cannot  be  interpreted  as  any  kind  of  quantum 
observer  when  plotted  against  energy  units.  There  is,  however,  a  very  straightforward 
interpretation  of  the  scales  of  pressure  and  energy  that  makes  this  assertion  incorrect. 
Unfortunately,  this  interpretation  had  never  occurred  to  me,  and  I  thereby  did  injustice 
to  the  authors  of  the  neural-quantum  theory.  Let  me  explain  this  interpretation  and 
the  scale  of  pressure  and  energy  units  that  I  had  in  mind  when  I  made  my  remarks. 

In  the  neural-quantum  procedure  we  have  a  continuous  sinusoidal  stimulus 
(call  it  the  standard).  At  specific  times  we  increase  briefly  the  amplitude  of  this  sinusoid 
and  the  observer's  task  is  to  detect  these  increments.  If  we  measure  the  pressure  of  the 
standard,  call  \ip,  and  measure  the  pressure  of  the  standard  plus  the  increment,  call  it 

From  /.  Acoust.  Soc.  Amer.,  1961,  33,  965.    Reprinted  with  permission. 

*  The  preparation  of  this  letter  was  supported  by  the  U.S.  Army  Signal  Corps,  the  Air 
Force  (Operational  Applications  Office  and  Office  of  Scientific  Research),  and  the  Office  of 
Naval  Research.   This  is  Technical  Note  No.  ESD  TN  61-56. 

1  D.  M.  Green,^ /icoM5^  Soc.  Am.,  1960,  32,  1189. 

^  F.  M.  Urban,  The  application  of  statistical  methods  to  the  problems  of  psychophysics, 
Philadelphia:  Psychological  Clinic  Press,  1908,  p.  107. 

3  S.  S.  Stevens,  Psychol.  Rev.,  1957,  64,  153. 
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p  +  A^,  then  by  subtracting  the  former  from  the  latter  we  obtain  on  a  pressure  scale 
values  of  A^.   We  may  call  this  quantity  /S.p  the  increment  of  pressure. 

Similarly,  if  we  measure  the  power  of  the  standard,  a  quantity  proportional  to 
p"^,  and  the  power  of  the  standard  plus  the  increment  a  quantity  proportional  to 
{p  +  ^pT,  we  might  subtract  the  former  from  the  latter,  and  (since  the  constants  of 
proportionality  are  the  same)  obtain  the  quantity,  (/>^  +  2^pp  +  A^^  _  ^2j  _ 
(lApp  +  Ap^).  The  latter  quantity  is  also  proportional  to  energy,  since  the  increment 
is  of  constant  duration,  and  we  may  call  this  quantity  the  increment  of  energy.  The 
important  result  is  that  these  two  quantities,  the  increment  in  pressure  and  the  increment 
in  energy,  are  nearly  linear  for  values  of  Ap  much  less  than  p.  If  some  data  are  exactly 
consistent  with  the  predictions  of  the  neural  quantum  theory  on  one  scale,  they  would 
very  nearly  be  consistent  on  the  other  scale. 

When  I  made  my  remarks,  I  had  in  mind  data  plotted  on  a  scale  of  signal 
pressure  or  signal  energy.  By  signal  I  mean  the  waveform  added  to  the  standard  that 
the  observers  are  asked  to  detect.  In  this  terminology,  the  pressure  of  the  signal  is 
proportional  to  Ap  and  the  energy  of  signal  is  proportional  to  that  quantity  squared, 
Ap^.  Only  data  plotted  on  a  scale  of  signal  pressure  as  I  have  now  defined  it  are  in 
agreement  with  the  predictions  of  neural  quantum  theory. 

Part  of  the  reason  for  my  oversight  undoubtedly  arose  from  the  fact  that  this 
measure  of  signal  energy  Ap^  is  the  quantity  I  used  in  presenting  some  of  the  data 
reported  later  in  my  paper.  There  is,  however,  no  inherent  reason  for  using  my  partic- 
ular measure  of  the  stimulus  and  I  should  have  made  my  reference  clear. 

In  some  cases  the  two  different  scales  of  energy  obtained  from  the  pressure  scale 
would  be  exactly  the  same.  This  would  happen  if  the  standard  and  signal  are  inco- 
herent; that  is,  if  the  middle  term  in  the  square  of  {Ap  +  p)  is  zero.  An  example  of 
this  would  be  an  increment  in  white  noise.  In  the  case  at  hand,  this  is  not  true  and  the 
quantity  that  I  have  called  increment  in  energy  and  the  quantity  that  I  called  signal 
energy  are  quite  different. 

The  general  point  I  was  trying  to  make  is  that  the  neural-quantum  theory  does 
not  specify  in  advance  how  the  physical  stimulus  should  be  measured.  It  was  my 
position  that  it  is  important  for  a  theory  of  psychophysics  to  specify  how  the  physical 
scale  is  related  to  the  expected  psychological  results.  This  position  is  apparently  not 
widely  endorsed.  I  am  particularly  impressed  with  the  number  of  theories  that  suggest 
that  the  psychometric  function  is  Gaussian,  log-Gaussian,  Poisson,  rectilinear,  or 
logistic,  but  cannot  specify  in  advance  what  particular  transformation  of  the  physical 
scale  will  yield  these  results.  It  is  not  hard  to  envision  different  circumstances  in  which 
all  these  assertions  are  true  at  least  in  the  sense  that  deviations  are  within  the  range  of 
experimental  error.  Somehow  there  never  seems  to  be  any  resolution  to  these  different 
findings. 

One  can,  of  course,  simply  ignore  all  this  and  go  on  measuring  only  one  arbitrary 
parameter  of  the  psychometric  function  such  as  the  "threshold"  value.  While  this 
position  obviously  has  the  merit  of  convenience,  it  would  also  appear  important  to 
demonstrate  how  all  of  these  different  results  might  come  about  from  one  single  general 
theory.  To  accomplish  the  latter  task  one  must  have  a  theory  which  carefully  specifies 
the  physical  part  of  the  psychophysical  theory. 
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This  paper  is  concerned  with  the 
century-old  effort  to  determine  the 
functional  relations  that  hold  between 
subjective  continua  and  the  physical 
continua  that  are  presumed  to  underlie 
them.  The  first,  and  easily  the  most 
influential,  attempt  to  specify  the  pos- 
sible relations  was  made  by  Fechner. 
It  rests  upon  empirical  knowledge  of 
how  discrimination  varies  with  inten- 
sity along  the  physical  continuum  and 
upon  the  assumption  that  jnd's  are 
subjectively  equal  throughout  the  con- 
tinuum. When,  for  example,  discrimi- 
nation is  proportional  to  intensity 
(Weber's  law),  Fechner  claimed  that 
the  equal- jnd  assumption  leads  to  a 
logarithmic  relation   (Fechner's  law). 

This  idea  has  always  been  subject 
to  controversy,  but  recent  attacks  upon 
it  have  been  particularly  severe.  At 
the  theoretical  level,  Luce  and  Edwards 

^  This  work  has  been  supported  in  part 
by  Grant  M-2293  from  the  National  Institute 
of  Mental  Health  and  in  part  by  Grant 
NSF-G  5544  from  the  National  Science 
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this  version.  I  am  particularly  indebted  to 
S.  S.  Stevens  for  his  very  detailed  substan- 
tive and  stylistic  criticisms  of  the  last  two 
drafts. 


( 1958)  have  pointed  out  that  Fechner's 
mathematical  reasoning  was  not  sound. 
Among  other  things,  his  assumption  is 
not  sufficient  to  generate  an  interval 
scale.  By  recasting  his  problem  some- 
what— essentially  by  replacing  the 
equal- jnd  assumption  with  the  some- 
what stronger  condition  that  "equally 
often  noticed  differences  are  equal,  ex- 
cept when  always  or  never  noticed" — 
they  were  able  to  show  that  an  interval 
scale  results,  and  to  present  a  mathe- 
matical expression  for  it.  Their  work 
has  no  practical  import  when  Weber's 
law,  or  its  linear  generalization  A-r 
=  ax  +  b,  is  true,  because  the  loga- 
rithm is  still  the  solution,  but  their  jnd 
scale  differs  from  Fechner's  integral 
when  Weber's  law  is  replaced  by  some 
other  function  relating  stimulus  jnd's 
to  intensity. 

At  the  empirical  level,  Stevens 
(1956,  1957)  has  argued  that  jnd's  are 
unequal  in  subjective  size  on  intensive, 
or  what  he  calls  prothetic,  continua — a 
contention  supported  by  considerable 
data — and  that  the  relation  between  the 
subjective  and  physical  continua  is  the 
power  function  ax^,  not  the  logarithm. 
Using  such  "direct"  methods  as  mag- 
nitude estimation  and  ratio  production, 
he  and  others  (Stevens:  1956,  1957; 
Stevens  &  Galanter,  1957)  have  accu- 
mulated considerable  evidence  to  but- 
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tress  the  empirical  generality  of  the 
power  function.  Were  it  not  for  the 
fact  that  some  psychophysicists  are  un- 
easy about  these  methods,  which  seem 
to  rest  heavily  upon  our  experience 
with  the  number  system,  the  point 
would  seem  to  be  established.  In  an 
effort  to  bypass  these  objections,  Ste- 
vens (1959)  has  recently  had  subjects 
match  values  between  pairs  of  con- 
tinua,  and  he  finds  that  the  resulting 
relations  are  power  functions  whose 
exponents  can  be  predicted  from  the 
magnitude  scales  of  the  separate  vari- 
ables. Thus,  although  much  remains 
to  be  learned  about  the  "direct"  meth- 
ods of  scaling,  the  resulting  power 
functions  appear  to  summarize  an  in- 
teresting body  of  data. 

Given  these  empirical  results,  one  is 
challenged  to  develop  a  suitable  formal 
theory  from  which  they  can  be  shown 
to  follow.  There  can  be  little  doubt 
that,  as  a  starting  point,  certain  com- 
monly made  assumptions  are  inappro- 
priate: equality  of  jnd's,  equally  often 
noticed  differences,  and  Thurstone's 
equal  variance  assumption.  Since, 
however,  differences  stand  in  the  same 
— logarithmic — relation  to  ratios  as 
Fechner's  law  does  to  the  power  func- 
tion, a  reasonable  starting  point  might 
seem  to  be  the  assumption  that  the 
subjective  ratio  of  stimuli  one  jnd  apart 
is  a  constant  independent  of  the  stimu- 
lus intensity.  Obvious  as  the  proce- 
dure may  seem,  in  my  opinion  it  will 
not  do.  Although  generations  of  psy- 
chologists have  managed  to  convince 
themselves  that  the  equal- jnd  assump- 
tion is  plausible,  if  not  obvious,  it  is 
not  and  never  has  been  particularly 
compelling;  and  in  this  respect,  an 
equal-ratio  assumption  is  not  much  dif- 
ferent. This  is  not  to  deny  that  sub- 
jective continua  may  have  the  equal- 
ratio  property — they  must  if  the  power 
law  is  correct  and  Weber's  law  holds — 
but  rather  to  argue  that  such  an  as- 


sumption is  too  special  to  be  acceptable 
as  a  basic  axiom  in  a  deductive  theory. 

Elsewhere  (Luce,  in  press),  I  have 
suggested  another  approach.  An  ax- 
iom, or  possible  law,  of  wide  applica- 
bility in  the  study  of  choice  behavior, 
may  be  taken  in  conjunction  with  the 
linear  generalization  of  Weber's  law  to 
demonstrate  the  existence  of  a  scale 
that  is  a  power  function  of  the  physical 
continuum.  Although  that  theory 
leads  to  what  appears  to  be  the  correct 
form,  it  is  open  to  two  criticisms. 
First,  the  exponent  predicted  from  dis- 
crimination data  is  at  least  an  order 
of  magnitude  larger  than  that  obtained 
by  direct  scaling  methods.  Second,  the 
theory  is  based  upon  assumptions 
about  discriminability,  and  these  are 
not  obviously  relevant  to  a  scale  deter- 
mined by  another  method.  Scales  of 
apparent  magnitude  may  be  related  to 
jnd.  scales,  but  it  would  be  unwise  to 
take  it  for  granted  that  they  are. 

The  purpose  of  this  paper  is  to  out- 
line still  another  approach  to  the  prob- 
lem, one  that  is  not  subject  to  the  last 
criticism.  The  results  have  applicabil- 
ity far  beyond  the  bounds  of  psycho- 
physics,  for  they  concern  the  general 
question  of  the  relation  between  meas- 
urement and  substantive  theories. 

Types  of  Scales 

Although  familiarity  may  by  now 
have  dulled  our  sense  of  its  importance, 
Stevens'  (1946,  1951)  stress  upon  the 
transformation  groups  that  leave  cer- 
tain specified  scale  properties  invariant 
must,  I  think,  be  considered  one  of  the 
more  striking  contributions  to  the  dis- 
cussion of  measurement  in  the  past  few 
decades.  Prior  to  his  work,  most  writ- 
ers had  put  extreme  emphasis  upon  the 
property  of  "additivity,"  which  is  a 
characteristic  of  much  physical  meas- 
urement (Cohen  &  Nagel,  1934).  It 
wks  held  that  this  property  is  funda- 
mental to  scientific  measurement  and, 
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indeed,  the  term  "fundamental  meas- 
urement" was  applied  only  to  these 
scales.  This  contention,  however,  puts 
the  nonphysical  sciences  in  a  most  pe- 
culiar fix.  Since  no  one  has  yet  dis- 
covered an  "additive"  psychological 
variable,  it  would  seem  that  psychology 
can  have  no  fundamental  measures  of 
its  own.  This  conclusion  niight  be 
acceptable  if  we  could  define  psycho- 
logical measures  in  terms  of  the  funda- 
mental physical  scales,  i.e.,  as  "derived" 
scales,  but  few  of  the  things  we  want 
to  measure  seem  to  be  definable  in  this 
way.  So  either  rigorous  psychological 
measurement  must  be  considered  im- 
possible or  additive  empirical  opera- 
tions must  not  be  considered  essential 
to  measurement.  What  is  important 
is  not  additivity  itself,  but  the  fact  that, 
when  it  is  coupled  with  other  plausible 
assumptions,  it  sharply  restricts  the 
class  of  transformations  that  may  be 
applied  to  the  resulting  scale.  Spe- 
cifically, it  makes  the  scale  unique  ex- 
cept for  multiplication  by  positive  con- 
stants, i.e.,  changes  of  unit.  Additivity 
is  not  the  only  property  that  an  assign- 
ment of  numbers  to  objects  or  events 
may  have  which  sharply  limits  the 
admissible  transformations.  Some  of 
these  other  properties  appear  applicable 
and  relevant  to  psychological  variables, 
and  so  in  this  sense  psychological 
measurement  appears  to  be  possible. 

By  a  theory  of  measurement,  I  shall 
mean  the  following.  One  or  more  op- 
erations and  relations  are  specified  over 
a  set  of  objects  or  events  (the  variable), 
and  they  are  characterized  by  a  num- 
ber of  empirically  testable  assumptions. 
In  addition,  it  must  be  possible  to  as- 
sign numbers  to  the  objects  and  to 
identify  numerical  operations  and  rela- 
tions with  the  empirical  operations  and 
relations  in  such  a  way  that  the  nu- 
merical operations  represent  (are  iso- 
morphic to)  the  empirical  ones.  In 
other  words,  we  have  a  measurement 


theory  whenever  (o)  we  have  a  system 
of  rules  for  assigning  numerical  values 
to  objects  that  are  interrelated  by  as- 
sumptions about  certain  empirical  op- 
erations involving  them,  and  (b)  these 
rules  let  us  set  up  an  isomorphic  rela- 
tion between  some  properties  of  the 
number  system  and  some  aspects  of 
the  empirical  operations. 

One  of  the  simplest  examples  of  a 
theory  of  measurement  is  a  finite  set 
(of  goods)  ordered  by  a  binary  (pref- 
erence) relation  P  that  is  assumed  to 
be  antisymmetric  and  transitive.  A 
scale  u  can  be  assigpned  to  the  set  in 
such  a  manner  that  it  represents  P  in 
the  sense  that  xPy  if  and  only  if  u{x) 
>u(y). 

By  the  scale  type,  I  shall  mean  the 
group  of  transformations  that  result  in 
other  isomorphic  representations  of  the 
measurement  theory.  In  the  preceding 
example  any  strictly  monotonic  increas- 
ing transformation  will  do,  and  scales 
of  this  type  are  known  as  ordinal. 
Any  transformation  chosen  from  the 
scale  type  will  be  said  to  be  an  admissi- 
ble transformation. 

It  should  be  re-emphasized  that  quite 
divergent  measurement  theories  may 
lead  to  the  same  scale  type.  For  ex- 
ample, Case  V  of  Thurstone's  law  of 
comparative  judgment  (1927)  and  the 
von  Neumann-Morgenstem  utility  ax- 
ioms (1947)  both  result  in  interval 
scales  (of  something),  yet  the  basic 
terms  and  assumptions  involved  are 
totally  different,  even  though  both 
theories  can  be  applied  to  the  same 
subject  matter.  Of  course,  the  result- 
ing interval  scales  may  not  be  linearly 
related,  for  they  may  be  measuring 
different  things. 

A  measurement  theory  may  be  con- 
trasted with  what  I  shall  call  a  sub- 
stantive theory.  The  former  involves 
operations  and  assumptions  only  about 
a  single  class  of  objects  which  is  treated 
as  a  unitary  variable,  whereas  the  lat- 
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ter  involves  relations  among  two  or 
more  variables.  In  practice,  substan- 
tive theories  are  usually  stated  in  terms 
of  fimctional  relations  among  the  scales 
that  result  from  the  several  measure- 
ment theories  for  the  variables  involved. 

For  a  number  of  purposes,  the  scale 
type  is  much  more  crucial  than  the 
details  of  the  measurement  theory  from 
which  the  scale  is  derived.  For  exam- 
ple, much  attention  has  been  paid  to 
the  limitations  that  the  scale  type 
places  upon  the  statistics  one  may  sen- 
sibly employ.  If  the  interpretation  of 
a  particular  statistic  or  statistical  test 
is  altered  when  admissible  scale  trans- 
formations are  applied,  then  our  sub- 
stantive conclusions  will  depend  upon 
which  arbitrary  representation  of  the 
scale  we  have  used  in  making  our  cal- 
culations. Most  scientists,  when  they 
understand  the  problem,  feel  that  they 
should  shun  such  statistics  and  rely 
only  upon  those  that  exhibit  the  ap- 
propriate invariances  for  the  scale  type 
at  hand.  Both  the  geometric  and  arith- 
metic means  are  legitimate  in  this  sense 
for  ratio  scales  (unit  arbitrary),  only 
the  latter  is  legitimate  for  interval 
scales  (unit  and  zero  arbitrary),  and 
neither  for  ordinal  scales.  For  fuller 
discussions,  see  Stevens:  1946,  1951, 
1955 ;  for  a  somewhat  less  strict  inter- 
pretation of  the  conclusions,  see  Mos- 
teller,  1958. 

A  second  place  where  the  transfor- 
mation group  imposes  limitations  is  in 
the  construction  of  substantive  theories. 
These  limitations  seem  to  have  received 
far  less  attention  than  the  statistical 
questions,  even  though  they  are  un- 
doubtedly more  fundamental.  The  re- 
mainder of  the  paper  will  attempt  to 
formulate  the  relation  between  scale 
types  and  functional  laws,  and  to  an- 
swer the  question  what  psychophysical 
laws  are  possible.  As  already  pointed 
out,  these  issues  have  scientific  rele- 
vance beyond  psychophysics. 


A  Principle  of  Theory 
Construction 

In  physics  one  finds  at  least  two 
classes  of  basic  assumptions :  specific 
empirical  laws,  such  as  the  universal 
law  of  gravitation  or  Ohm's  law,  and 
a  priori  principles  of  theory  construc- 
tion, such  as  the  requirement  that  the 
laws  of  mechanics  should  be  invariant 
under  uniform  translations  and  rota- 
tions of  the  coordinate  system.  Other 
laws,  such  as  the  conservation  of  en- 
ergy, seem  to  have  changed  from  the 
empirical  to  the  a  priori  category  dur- 
ing the  development  of  physics.  In 
psychology  more  stress  has  been  put 
on  the  discovery  of  empirical  laws  than 
on  the  formulation  of  guiding  princi- 
ples, and  the  search  for  empirical  rela- 
tions tends  to  be  pursued  without  the 
benefit  of  explicit  statements  about 
what  is  and  is  not  an  acceptable  the- 
ory.^ Since  such  principles  have  been 
used  effectively  in  physics  to  limit  the 
possible  physical  laws,  one  wonders 
whether  something  similar  may  not  be 
possible  in  psychology. 

Without  such  principles,  practically 
any  relation  is  a  priori  possible,  and 
the  correct  one  is  difficult  to  pin  down 
by  empirical  means  because  of  the  ever 
present  errors  of  observation.  The 
error  problem  is  particularly  acute  in 
the  behavioral  sciences.  On  the  other 
hand,  if  a  priori  consideration  about 
what  constitutes  an  acceptable  theory 
limits  us  to  some  rather  small  set  of 
possible  laws,  then  fairly  crude  obser- 

2  Two  attempts  to  introduce  and  use  such 
statements  in  behavioral  problems  are  the 
combining  of  classes  condition  in  stochastic 
learning  theory  (Bush,  Hosteller,  &  Thomp- 
son, 1954)  and  some  work  on  the  form  of 
the  utility  function  for  money  which  is  based 
upon  the  demand  that  certain  game  theory 
solutions  should  remain  unchanged  when  a 
constant  sum  of  money  is  added  to  all  the 
payoffs  (Kemeny  &  Thompson,  1957).  In 
neither  case  do  the  conditions  seem  particu- 
larly compelling. 


R.    DUNCAN   LUCE 


73 


vations  may  sometimes  suffice  to  decide 
which  law  actually  obtains. 

The  principle  to  be  suggested  appears 
to  be  a  generalization  of  one  used  in 
physics.     It  may  be  stated  as  follows. 

A  substantive  theory  relating  two 
or  more  variables  and  the  meas- 
urement theories  for  these  varia- 
bles should  be  such  that: 

1.  (Consistency  of  substantive 
and  measurement  theories)  Admis- 
sible transformations  of  one  or 
more  of  the  independent  variables 
shall  lead,  via  the  substantive  the- 
ory, only  to  admissible  transfor- 
mations of  the  dependent  variables. 

2.  (Invariance  of  the  substan- 
tive theory)  Except  for  the  nu- 
merical values  of  parameters  that 
reflect  the  effect  on  the  dependent 
variables  of  admissible  transfor- 
mations of  the  independent  vari- 
ables, the  mathematical  structure 
of  the  substantive  theory  shall  be 
independent  of  admissible  trans- 
formations of  the  independent 
variables. 

In  this  principle,  and  in  what  fol- 
lows, the  terms  independent  and  de- 
pendent variables  are  used  only  to 
distinguish  the  variables  to  which  arbi- 
trary, admissible  transformations  are 
imposed  from  those  for  which  the 
transformations  are  determined  by  the 
substantive  theory.  As  will  be  seen, 
in  some  cases  the  labeling  is  truly  arbi- 
trary in  the  sense  that  the  substantive 
theory  can  be  written  so  that  any  vari- 
able appears  either  in  the  dependent 
or  independent  role,  but  in  other  cases 
there  is  a  true  asymmetry  in  the  sense 
that  some  variables  must  be  dependent 
and  others  independent  if  any  substan- 
tive theory  relates  them  at  all. 

One  can  hardly  question  the  con- 
sistency part  of  the  principle.  If  an 
admissible  transformation  of  an  inde- 
pendent variable  leads  to  an  inadmissi- 


ble transformation  of  a  dependent  vari- 
able, then  one  is  simply  saying  that  the 
strictures  imposed  by  the  measurement 
theories  are  incompatible  with  those 
imposed  by  the  substantive  theory. 
Such  a  logical  inconsistency  must,  I 
think,  be  interpreted  as  meaning  that 
something  is  amiss  in  the  total  theo- 
retical structure. 

The  invariance  part  is  more  subtle 
and  controversial.  It  asserts  that  we 
should  be  able  to  state  the  substantive 
laws  of  the  field  without  reference  to 
the  particular  scales  that  are  used  to 
measure  the  variables.  For  example, 
we  want  to  be  able  to  say  that  Ohm's 
law  states  that  voltage  is  proportional 
to  the  product  of  resistance  and  current 
without  specifying  the  units  that  are 
used  to  measure  voltage,  resistance,  or 
current.  Put  another  way,  we  do  not 
want  to  have  one  law  when  one  set  of 
units  is  used  and  another  when  a  differ- 
ent set  of  units  is  used.  Although  this 
seems  plausible,  there  are  examples 
from  physics  that  can  be  viewed  as  a 
particular  sort  of  violation  of  Part  2: 
however,  let  us  postpone  the  discussion 
of  these  until  some  consequences  of  the 
principle  as  stated  have  been  derived. 

The  meaning  of  the  principle  may 
be  clarified  by  examples  that  violate  it. 
Suppose  it  is  claimed  that  two  ratio 
scales  are  related  by  a  logarithmic  law. 
An  admissible  transformation  of  the 
independent  variable  x  is  multiplication 
by  a  positive  constant  k,  i.e.,  a  change 
of  unit.  However,  the  fact  that  log 
kx  =  log  k  -\-  log  X  means  that  an  in- 
admissible transformation,  namely,  a 
change  of  zero,  is  effected  on  the  de- 
pendent variable.  Hence,  the  loga- 
rithm fails  to  meet  the  consistency 
requirement.  Next,  consider  an  expo- 
nential law,  then  the  transformation 
leads  to  e*=*  =  (^)^  This  can  be 
viewed  either  as  a  violation  of  con- 
sistency or  of  invariance.  If  the  law 
is  exponential,  then  the  dependent  vari- 
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able  is  raised  to  a  power,  which  is 
inconsistent  ^^^th  its  being  a  ratio  scale. 
Alternatively,  the  dependent  variable 
may  be  taken  to  be  a  ratio  scale,  but 
then  the  law  is  not  invariant  because 
it  is  an  exponential  raised  to  a  power 
that  depends  upon  the  unit  of  the  inde- 
pendent variable. 

An    APPLICA.TION    OF    THE    PRINCIPLE 

Most  of  the  physical  measures  en- 
tering into  psychophysics  are  idealized 
in  physical  theories  in  such  a  way  that 
they  form  either  ratio  or  interval 
scales.  Mass,  length,  pressure,  and 
time  durations  are  measured  on  ratio 
scales,  and  physical  time  (not  time 
durations),  ordinary  temperature,  and 
entropy  are  measured  on  interval 
scales.  Of  course,  differences  and  de- 
rivatives of  interval  scale  values  con- 
stitute ratio  scales. 

Although  most  psychological  scales 
in  current  use  can  at  best  be  con- 
sidered to  be  ordinal,  those  who  have 
worked  on  psychological  measurement 
theories  have  attempted  to  arrive  at 
scales  that  are  either  ratio  or  interval, 
preferably  the  former.  Examples: 
the  equally  often  noticed  difference 
assumption  and  the  closely  related 
Case  V  of  Thurstone's  "law  of  com- 
parative judgment"  lead  to  interval 
scales;  Stevens  has  argued  that  mag- 
nitude estimation  methods  result  in 
ratio  scales  (but  no  measurement  the- 
ory has  been  offered  in  support  of  this 
plausible  belief) ;  and  I  have  given  suf- 
ficient conditions  to  derive  a  ratio 
scale  from  discrimination  data.  Our 
question  here,  however,  is  not  how 
well  psychologists  have  succeeded  in 
perfecting  scales  of  one  type  or  an- 
other, but  what  a  knowledge  of  scale 
types  can  tell  us  about  the  relations 
among  scales. 

In  addition  to  these  two  common 
types  of  scales,  there  is  some  interest 
in  what  have  been  called  logarithmic 


interval  scales  (Stevens,  1957).  In  this 
case  the  admissible  transformations 
are  multiplications  by  positive  con- 
stants and  raising  to  positive  powers, 
i.e.,  kx",  where  ^  >  0  and  c  >  0.  The 
name  applied  to  this  scale  type  re- 
flects the  fact  that  log  x  is  an  interval 
scale,  since  the  transformed  scale  goes 
into  c  log  X  +  log  k.  We  will  consider 
all  combinations  of  ratio,  interval,  and 
logarithmic  interval  scales. 

Because  this  topic  is  more  general 
than  psychophysics,  I  shall  refer  to 
the  variables  as  independent  and  de- 
pendent rather  than  physical  and  psy- 
chological. Both  variables  will  be 
assumed  to  form  numerical  continua 
having  more  than  one  point.  Let 
X  >  0  denote  a  typical  value  of  the 
independent  variable  and  u{x)>  0 
the  corresponding  value  of  the  de- 
pendent variable,  where  u  is  the  un- 
known functional  law  relating  them. 
Suppose,  first,  that  both  variables 
form  ratio  scales.  If  the  unit  of  the 
independent  variable  is  changed  by 
multiplying  all  values  by  a  positive 
constant  k,  then  according  to  the 
principle  stated  above  only  an  ad- 
missible transformation  of  the  de- 
pendent variable,  namely  multiplica- 
tion by  a  positive  constant,  should  re- 
sult and  the  form  of  the  functional  law 
should  be  unaffected.  That  is  to  say, 
the  changed  unit  of  the  dependent 
variable  may  depend  upon  k,  but  it 
shall  not  depend  upon  x,  so  we  denote 
it  by  K{k).  Casting  this  into  mathe- 
matical terms,  we  obtain  the  func- 
tional equation 

u{kx)  =  K{k)u{x) 

where  k>  Q  and  K{k)  >  0. 

Functional  equations  for  the  other 
cases  are  arrived  at  in  a  similar  man- 
ner.    They  are  summarized  in  Table  1 . 

The  question  is :  What  do  these  nine 
functional  equations,  each  of  which 
embodies  the  principle,  imply  about 
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TABLE  1 

The  Functional  Equations  for  the  Laws  Satisfying  the 
Principle  of  Theory  Construction 


Eq. 

Scale  Types 

Functional  Equation 

No. 

Comments 

Independent 

Dependent 

Variable 

Variable 

1 

ratio 

ratio 

u(kx)=K{k)u(x) 

k>0.  K(k)>0 

2 

ratio 

interval 

u(kx)=Kik)u{x)+C{k) 

k>0,K(k)>0 

3 

ratio 

log  interval 

u{kx)=K{k)u{x)<^w 

k>0,K{k)>0,  C(k)>0 

4 

interval 

ratio 

u(kx+c)=K{k,c)u{x) 

k>0,  K{k,c)>0 

5 

interval 

interval 

uikx+c)  =K(k,c)u(x) 
+  C{k,c) 

k>0,K(k,c)>0 

6 

interval 

log  interval 

u(kx+c)=K(k,c)u{x)c<>'''^ 

k>0,K(k,c)>0,  Cik,c)>0 

7 

log  interval 

ratio 

u(kx')=K{k,c)u(x) 

k>0,  c>0,  K(k,c)>0 

8 

log  interval 

interval 

uikx')  =K(k,c)u{x)  +C{k,c} 

k>0,c>0,K(k,c)>0 

9 

log  interval 

log  interval 

u(kx')='K(k,c)u(x)(^^'''''> 

k>0,  OO,  K(k,c)>0, 
Cik,c)>0 

the  form  of  m?  We  shall  limit  our 
consideration  to  theories  where  u  is 
a  continuous,  nonconstant  function 
of  X. 

Theorem  1.  If  the  independent  and 
dependent  continua  are  both  ratio  scales, 
then  u  (x)  =  ax^,  where  /3  is  independent 
of  the  units  of  both  variables.^ 

Proof.  Set:x;  =  1  in  Equation  1,  then 
u{k)  =  K{k)u{l).  Because  u  is  non- 
constant  we  may  choose  k  so  that 
u{k)  >  0,  and  because  K(k)  >  0,  it 
follows  that  m(1)  >  0,soK{k)  =  u(k)/ 
M  (1) .   Thus,  Equation  1  becomes  u  (kx) 

3  In  this  and  in  the  following  theorems, 
the  statement  can  be  made  more  general* if 
X  is  replaced  by  x  +  7,  where  7  is  a  constant 
independent  of  x  but  having  the  same  unit  as 
X.  The  effect  of  this  is  to  place  the  zero  of 
u  at  some  point  different  from  the  zero  of  x. 
In  psychophysics  the  constant  7  may  be  re- 
garded as  the  threshold.  The  presence  of 
such  a  constant  means,  of  course,  that  a  plot 
of  log  M  vs.  log  X  will  not  in  general  be  a 
straight  line.  If,  however,  the  independent 
variable  is  measured  in  terms  of  deviations 
from  the  threshold,  the  plot  may  become 
straight.  Such  nonlinear  plots  have  been 
observed,  and  in  at  least  some  instances  the 
degree  of  curvature  seems  to  be  correlated 
with  the  magnitude  of  the  threshold.  Fur- 
ther empirical  work  is  needed  to  see  whether 
this  is  a  correct  explanation  of  the  curvature. 


=  u{k)u(x)/u{l).  Lety  =  log[« /«(!)], 
then 

v{kx)  =  log[_u{kx)/u{l)^ 
u(k)u{x) 
"    °^m(1)«(1) 

+  log  [m(x)/m(1)] 
=  v(k)  -{-  v(x) 

Since  u  is  continuous,  so  is  v,  and  it  is 
well  known  that  the  only  continuous 
solutions  to  the  last  functional  equa- 
tion are  of  the  form 

v{x)  =  /3  log  jc 
=  log  x^ 
Thus, 

u(x)  =  ae'^'^ 
=  ax^ 

where  a  =  u{l). 

We  observe  that  since 

u{kx)  =  ak^x^  =  a'x^ 

/3  is  independent  of  the  unit  of  x,  and 
it  is  clearly  independent  of  the  unit 
of  u. 

Theorem  2.     If  the  independent  con- 
tinuum is  a  ratio  scale  and  the  depend- 
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ent  continuum  an  interval  scale,  then 
either  u  (x)  =  a  log  x  -\-  ^,  where  a  is 
independent  of  the  unit  of  the  inde- 
pendent variable,  or  u(x)  =  ax^  +  8, 
where  /8  is  independent  of  the  units  of 
both  variables  and  5  is  independent  of 
the  unit  of  the  independent  variable. 

Proof.  In  solving  Equation  2,  there 
are  two  possibilities  to  consider. 

1.  li  K{k)  =  1,  then  define  v  =  e". 
Equation  2  becomes  v  (^x)  =  D(k)v(x), 
where  D(k)  =  e^^''^  >  0  and  v  is  con- 
tinuous, positive,  and  nonconstant  be- 
cause u  is.  By  Theorem  l,v(x)  =  Sx", 
where  a  is  independent  of  the  unit  of 
X  and  where  5  >  0  because,  by  defini- 
tion, V  >  0.  Taking  logarithms,  u(x) 
=  a  log  X+/8,  where  0=  log  5. 

2.  U  K{k)  ^  I,  then  let  u  and  u* 
be  two  different  solutions  to  the  prob- 
lem, and  define  w  =  u*  —  u.  It  fol- 
lows immediately  from  Equation  2 
that  w  must  satisfy  the  functional 
equation  w{kx)  =  K{k)w(x).  Since 
both  u  and  u*  are  continuous,  so  is  w; ; 
however,  it  may  be  a  constant.  Since 
K(k)  ^  I,  it  is  clear  that  the  only 
constant  solution  isw  =  0,  and  this  is 
impossible  since  u  and  u*  were  chosen 
to  be  different.  Thus,  by  Theorem  1, 
w(x)  =  ax^.  Substituting  this  into  the 
functional  equation  for  w,  it  follows 
that  K{k)  =  ¥.  Then  setting  x  =  0 
in  Equation  2,  we  obtain  C{k)  =  w(0) 
X(l  —  k^).  We  now  observe  that 
u(x)  =  ax^  -\-  8,  where  8  =  u(0),  is  a 
solution  to  Equation  2 : 

u(kx)  =  ak^x^-\-8 

=  a¥x^+u(0)k^-\-uiO)-u(P)k^ 

=  k^u(x)+u{0)il-k^) 

=  K(k)uix)-\-C(k) 

Any  other  solution  is  of  the  same  form 
because 

u*{x)  =  u(x)  +  w(x) 

=  ax^  -\-  8  -{-  ax^ 

=  (a-\-  a)x^  +  8 


It  is  easy  to  see  that  5  is  independent 
of  the  unit  of  x  and  jS  is  independent 
of  both  units. 

A  much  simpler  proof  of  this  theo- 
rem can  be  given  if  we  assume  that  u 
is  dififerentiable  in  addition  to  being 
continuous.  Since  the  derivative  of  an 
interval  scale  is  a  ratio  scale,  it  follows 
immediately     that     du/dx     satisfies 

Equation  1  and  so,  by  Theorem  1, 


c^.     Integrating,  we  get 


dx 


u{x)  =  j  |8  +  1 

I  a  logx+S 


x^+i  +  5    if    /8  5^  -  1 
if    fl  =  -  1 


Theorem  3.  If  the  independent  con- 
tinuum is  a  ratio  scale  and  the  depend- 
ent continuum  is  a  logarithmic  interval 
scale,  then  either  u{x)  —  Se'"^,  where  a 
is  independent  of  the  unit  of  the  de- 
pendent variable,  ^  is  independent  of  the 
units  of  both  variables  and  8  is  inde- 
pendent of  the  unit  of  the  independent 
variable,  or  u(x)  =  ax^,  where  /8  is  in- 
dependent of  the  units  of  both  variables. 

Proof.     Take  the  logarithm  of  Equa- 
tion 3  and  let  v  =  log  u : 

vikx)  =  K*ik)  +  Cik)v(x) 

where  K*ik)  =  logK(k).     By  Theo- 
rem 2,  either 

v(x)  =  ax^  -\-  8*  or  v(x)  =  13  log  x  -{-  a* 

Taking  exponentials,  either 

u(x)  =  8e'"^    or    u{x)  =  ax^ 

where  8  =  €**  and,  in  the  second  equa- 
tion, a  =  e"*. 

Theorem  4.  If  the  independent  con- 
tinuum is  an  interval  scale,  then  it  is 
impossible  for  the  dependent  continuum 
to  be  a  ratio  scale. 

Proof.     Let  c  =  0  in  Equation  4,  then 
by  Theorem  1  we  know  u{x)  =  ax^. 
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Now  set  ^  =  1  and  c  5^  0  in  Equa- 
tion 3 : 

a(x  +  cy  =  K{l,c)axP 
SO 

X  -^  c  =  K(l,cy/^x 

which  impHes  x  is  a  constant,  con- 
trary to  our  assumption  that  both 
continua  have  more  than  one  point. 

Theorem  5.  If  the  independent  and 
dependent  continua  are  both  interval 
scales,  then  u{x)  =  ax  -\-  ^,  where  /3  is 
independent  of  the  unit  of  the  inde- 
pendent variable. 

Proof.  If  we  let  c  =  0,  then  Equa- 
tion 5  reduces  to  Equation  2  and  so 
Theorem  2  applies.  If  «(x)  =  a  log  x 
-\-  jS,  then  choosing  k  =  \  and  c  7^  0 
in  Equation  5  yields 

a  log  (x  +  c)  +  (S  =  K{\,c)a  log  x 

+  X(l,c)/3  +  C(l,c) 

By  taking  the  derivative  with  respect 
to  X,  it  is  easy  to  see  that  x  must  be 
a  constant,  which  is  impossible. 

Thus,  we  must  conclude  that  u{x) 
=  ax^  -f  (S.  Again,  set  k  =  1  and 
c  5^  0, 

a(x  -f  cY  =  i^(l,c)ax« 

-f  X(l,c)|3  +  C(1,C) 

If  5  ?^  1,  then  differentiate  with  re- 
spect to  X : 

aS(x  +  cy-'  =  K{l,c)a5x^-' 

which  implies  x  is  a  constant,  so  we 
must  conclude  5  =  1.  It  is  easy  to  see 
that  m(x)  =  ox  +  j8  satisfies  Equation 
5. 

Theorem  6.  If  the  independent  con- 
tinuum is  an  interval  scale  and  the 
dependent  continuum  is  a  logarithmic 
interval  scale,  then  u(x)  =  ae^",  where 
a  is  independent  of  the  unit  of  the  inde- 
pendent variable  and  /S  is  independent 
of  the  unit  of  the  dependent  variable. 


Proof.  Take  the  logarithm  of  Equa- 
tion 6  and  let  v  =  log  u: 

v{kx  +  c)  =  K*{k,c)  +  C(k,c)v{x) 

where  K*{k,c)  =  log  K{k,c).  By 
Theorem  5, 

v(x)  =  /3x-fa* 
so 

u(x)  =  ae^^ 

where  a  =  e"*. 

Theorem  7.  If  the  independent  con- 
tinuum is  a  logarithmic  interval  scale, 
then  it  is  impossible  for  the  dependent 
continuum  to  be  a  ratio  scale. 

Proof.  Let  u(logx)  =  u{x),  i.e.,  v{y) 
=  w(e^),  then  Equation  7  becomes 

Z'(log^  +  c  log  x)  =  K{k,c)u{\og  x) 

Thus,  log  X  is  an  interval  scale  and  v  is 
a  ratio  scale,  which  by  Theorem  4  is 
impossible. 

Theorem  8.  If  the  independent  con- 
tinuum is  a  logarithmic  interval  scale 
and  the  dependent  continuum  is  an  in- 
terval scale,  then  u(x)  =  a  log  x  -f  /3, 
where  a  is  independent  of  the  unit  of  the 
independent  variable. 

Proof.  Let  v(\og  x)  =  u{x),  then 
Equation  8  becomes 

V  (log  k  +  c  log  x) 

=  K(k,c)v {log  x)  +  C{k,c) 

so  log  X  and  v  are  both  interval  scales. 
By  Theorem  5, 

u(x)  =  i'(logx) 

=  a  log  X  -f  /3 

Theorem  9.  If  the  independent  and 
dependent  continua  are  both  logarithmic 
interval  scales,  then  u{x)  =  ax^,  where 
/3  is  independent  of  the  units  of  both  the 
independent  and  dependent  variables. 
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Proof.     Take  the  logarithm  of  Equa- 
tion 9  and  let  v  =  log  u : 

v(kx')  =  K*ik,c)  +  C{k,c)vix) 

where    K*{k,c)   =  log  K(k,c).       By 
Theorem  8, 


v(x)  =  iSlog:*;  +  a" 


so 


u(x)  =  «"(*' 
=  ax^ 
where  a  =  C*. 

Illustrations 

It  may  be  useful,  prior  to  discussing 
these  results,  to  cite  a  few  familiar 
laws  that  accord  with  some  of  them. 
The  best  source  of  examples  is  classi- 
cal physics,  where  most  of  the  funda- 
mental variables  are  idealized  as  con- 
tinua  that  form  either  ratio  or  interval 
scales.  No  attempt  will  be  made  to 
illustrate  the  results  concerning  loga- 
rithmic interval  scales,  because  no 
actual  use  of  scales  of  this  type  seems 
to  have  been  made. 

The  variables  entering  into  Cou- 
lomb's law.  Ohm's  law,  and  Newton's 
gravitation  law  are  all  ratio  scales,  and 
in  each  case  the  form  of  the  law  is  a 
power  function,  as  called  for  by  Theo- 
rem 1.  Additional  examples  of  Theo- 
rem 1  can  be  found  in  geometry  since 
length,  area,  and  volume  are  ratio 
scales;  thus  the  dependency  of  the 
volume  of  a  sphere  upon  its  radius  or 
of  the  area  of  a  square  on  its  side  are 
illustrations. 

Other  important  variables  such  as 
energy  and  entropy  form  interval 
scales,  and  we  can  therefore  anticipate 
that  as  dependent  variables  they  will 
illustrate  Theorem  2.  If  a  body  of 
constant  mass  is  moving  at  velocity  v, 
then  its  energy  is  of  the  form  av^  +  8. 
If-  the  temperature  of  a  perfect  gas  is 
constant,  then  as  a  function  of  pres- 
sure p  the  entropy  of  the  gas  is  of  the 


form   a  logp  -\-  /8.     No   examples,    of 
course,  are  possible  for  Theorem  4. 

As  an  example  of  Theorem  5  we 
may  consider  ordinary  temperature, 
which  is  frequently  measured  in  terms 
of  the  length  of  a  column  of  mercury. 
Although  length  as  a  measure  forms  a 
ratio  scale,  the  length  of  a  column  of 
mercury  used  to  measure  temperature 
is  an  interval  scale  (subject  to  the 
added  constraint  that  the  length  is 
positive),  since  we  may  choose  any 
initial  length  to  correspond  to  a  given 
temperature,  such  as  the  freezing 
point  of  water.  If  the  temperature 
scale  is  also  an  interval  scale,  as  is 
usually  assumed,  then  the  only  rela- 
tion possible  according  to  Theorem  5 
is  the  linear  one. 

Discussion 

Some  with  whom  I  have  discussed 
these  theorems — which  from  a  mathe- 
matical point  of  view  are  not  new — 
have  had  strong  misgivings  about 
their  interpretation ;  the  feeling  is  that 
something  of  a  substantive  nature 
must  have  been  smuggled  into  the 
formulation  of  the  problem.  They 
argue  that  practically  any  functional 
relation  can  hold  between  two  vari- 
ables and  that  it  is  an  empirical,  not 
a  theoretical,  matter  to  ascertain  what 
the  function  may  be  in  specific  cases. 
To  support  this  view  and  to  challenge 
the  theorems,  they  have  cited  ex- 
amples from  physics,  such  as  the  ex- 
ponential law  of  radioactive  decay  or 
some  sinusoidal  function  of  time,  which 
seem  to  violate  the  theorems  stated 
above.  We  must,  therefore,  examine 
the  ways  in  which  these  examples  by- 
pass the  rather  strong  conclusions  of 
the  present  theory. 

All  physical  examples  which  have 
been  suggested  to  me  as  counter- 
examples to  the  theorems  have  a 
common  form :  the  independent  vari- 
able is  a  ratio  scale,  but  it  enters  into 
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the  equation  in  a  dimensionless  fash- 
ion. For  example,  some  identifiable 
value  of  the  variable  is  taken  as  the 
reference  level  xo,  and  all  other  values 
are  expressed  in  reference  to  it  as  x/xo. 
The  effect  of  this  is  to  make  the  quan- 
tity x/xo  independent  of  the  unit  used 
to  measure  the  variable,  since  kx/kxo 
=  x/xo.  In  periodic  functions  of 
time,  the  period  is  often  used  as  a 
reference  level.  Slightly  more  gen- 
erally, the  independent  variable  only 
appears  multiplied  by  a  constant  c 
whose  units  are  the  inverse  of  those 
of  X.  Thus,  whenever  the  unit  of  x 
is  changed  by  multiplying  all  values 
by  a  constant  ^  >  0,  it  is  necessary  to 
adjust  the  unit  of  c  by  multiplying  it 
by  l/k.  But  this  means  that  the 
product  is  independent  of  k :  (c/k)  (kx) 
=  ex.  The  time  constant  in  the  law 
of  radioactive  decay  is  of  this  nature. 

There  are  two  ways  to  view  these 
examples  in  relation  to  the  principle 
stated  above.  If  the  ratio  scale  x  is 
taken  to  be  the  independent  variable, 
then  the  invariance  part  of  the  prin- 
ciple is  not  satisfied  by  these  laws.  If, 
however,  for  the  purpose  of  the  law 
under  consideration  the  dimensionless 
quantity  ex  is  treated  as  the  variable, 
then  no  violation  has  occurred.  Al- 
though surprising  at  first  glance,  it  is 
easy  to  see  that  the  principle  imposes 
no  limitations  upon  the  form  of  the 
law  when  the  independent  variable  is 
dimensionless,  i.e.,  when  no  trans- 
formations save  the  identity  are  ad- 
missible. 

We  are  thus  led  to  the  following  con- 
clusion. Either  the  independent  vari- 
able is  a  ratio  scale  that  is  multiplied 
by  a  dimensional  constant  that  makes 
the  product  independent  of  the  unit  of 
the  scale,  in  which  case  there  is  no  re- 
striction upon  the  laws  into  which  it 
may  enter,  or  the  independent  vari- 
able is  not  rendered  dimensionless,  in 
which  case  the  laws  must  be  of  the 


form  described  by  the  above  theorems. 
Both  situations  are  found  in  classical 
physics,  and  one  wonders  if  there  is 
any  fundamental  difTerence  between 
them.  I  have  not  seen  any  discussion 
of  the  matter,  and  I  have  only  the 
most  uncertain  impression  that  there 
is  a  difference.  In  many  physical  situa- 
tions where  a  dimensional  constant 
multiplies  the  independent  variable, 
the  dependent  variable  is  bounded. 
This  is  true  of  both  the  decay  and 
periodic  laws.  Usually,  the  constant 
is  expressed  in  some  natural  way  in 
terms  of  the  bounds,  as,  for  example, 
the  period  of  a  periodic  function. 
Whether  dimensional  constants  can 
legitimately  be  used  in  other  situa- 
tions, or  whether  they  can  always  be 
eliminated,  is  not  at  all  apparent  to 
me. 

One  may  legitimately  question  which 
of  these  alternatives  is  applicable  to 
psychophysics,  and  the  answer  is  far 
from  clear.  The  widespread  use  of, 
say,  the  threshold  as  a  reference  level 
seems  at  first  to  suggest  that  psycho- 
physical laws  are  to  be  expressed  in 
terms  of  dimensionless  quantities; 
however,  the  fact  that  this  is  done 
mainly  to  present  results  in  decibels 
may  mean  no  more  than  that  the 
given  ratio  scale  is  being  transformed 
into  an  interval  scale  in  accordance 
with  Theorem  2  : 


where 


y  =  a  log  x/xo 
=  alogx  -\-  0 

j8  =  —  a  log  a:o 


In  addition  to  dimensionless  vari- 
ables as  a  means  of  by-passing  the  re- 
strictions imposed  by  scale  types, 
three  other  possibilities  deserve  dis- 
cussion. 

First,  the  idealization  that  the  scales 
form  mathematical  continua  and  that 
they  are  related  by  a  continuous  func- 
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tion  may  not  reflect  the  actual  state 
of  affairs  in  the  empirical  world.  It 
is  certainly  true  that,  in  detail,  physi- 
cal continua  are  not  mathematical 
continua,  and  there  is  ample  reason 
to  suspect  that  the  same  holds  for 
psychological  variables.  But  the  as- 
sumptions that  stimuli  and  responses 
both  form  continua  are  idealizations 
that  are  difficult  to  give  up;  to  do 
so  would  mean  casting  out  much 
of  psychophysical  theory.  Alterna- 
tively, we  could  drop  the  demand  that 
the  function  relating  them  be  con- 
tinuous, but  it  is  doubtful  if  this 
would  be  of  much  help  by  itself.  The 
discontinuous  solutions  to,  say,  Equa- 
tion 1  are  manifold  and  extremely  wild 
in  their  behavior.  They  are  so  wild 
that  it  is  difficult  to  say  anything  pre- 
cise about  them  at  all  (see  Hamel, 
1905;  Jones:  1942a,  1942b),  and  it  is 
doubtful  that  such  solutions  represent 
empirical  laws. 

Second,  casual  observation  suggests 
that  it  might  be  appropriate  to  assume 
that  at  least  the  dependent  variable  is 
bounded,  e.g.,  that  there  is  a  psycho- 
logically maximum  loudness.  Al- 
though plausible,  boundedness  cannot 
be  imposed  by  itself  since,  as  is  shown 
in  the  theorems,  all  the  continuous 
solutions  to  the  appropriate  functional 
equations  are  unbounded  if  the  func- 
tions are  increasing,  as  they  must  be 
for  empirical  reasons.  It  seems  clear 
that  boundedness  of  the  dependent 
variable  is  intimately  tied  up  either 
with  introducing  a  reference  level  so 
that  the  independent  variable  is  an 
absolute  scale  or  with  some  discon- 
tinuity in  the  formulation  of  the  prob- 
lem, possibly  in  the  nature  of  the 
variables  or  possibly  in  the  function 
relating  them.  Actually,  one  can  es- 
tablish that  it  must  be  in  the  nature 
of  the  variables.  Suppose,  on  the 
contrary,  that  the  variables  are  ratio 
scales  that  form  numerical  continua 


and  that  they  are  related  by  a  func- 
tion u  that  is  nonnegative,  noncon- 
stant,  and  monotonic  increasing,  but 
not  necessarily  continuous.  We  now 
need  only  show  that  u  cannot  be 
bounded  to  show  that  the  discon- 
tinuity must  exist  in  the  variable. 
Suppose,  therefore,  that  it  is  bounded 
and  that  the  bound  is  M.  By  Equa- 
tion 1,  u{kx)  =  K{k)u(x)<  M,  so 
u (x)  <  M/Kik).  For  k>\,  the mon- 
otonicity  of  u  implies  that  u{x) 
<  u  {kx)  =  K(Ji)u  {x) ,  so  choosing  u  {x) 
>0  we  see  that  K{k)>  1.  If  for 
some  ^  >  1,  K{k)  >  1,  then  K  can  be 
made  arbitrarily  large  since,  for  any 
integer  n,  K^k")  ^  Kik)"",  but  since 

,  .  ^    M      ,  .    .      ,. 
"(^)  ^  T^/Lx,  this  implies  w  =  0,  con- 

trary  to  assumption.  Thus,  for  all 
k>  1,  K(k)  =  1,  which  by  Equation 
1,  means  u(kx)  =  u(x),  for  all  x  and 
^>  1.  This  in  turn  implies  «  is  a 
constant,  which  again  is  contrary  to 
assumption.  Thus,  we  have  estab- 
lished our  claim  that  some  discon- 
tinuity must  reside  in  the  nature  of 
the  variables. 

Third,  in  many  situations,  there  are 
two  or  more  independent  variables; 
for  example,  both  intensity  and  fre- 
quency determine  loudness.  Usually 
we  hold  all  but  one  variable  constant 
in  our  empirical  investigations,  but 
the  fact  remains  that  the  others  are 
there  and  that  their  presence  may 
make  some  difference  in  the  total 
range  of  possible  laws.  For  example, 
suppose  there  are  two  independent 
variables,  x  and  y,  both  of  which 
form  ratio  scales  and  that  the  depend- 
ent variable  u  is  also  a  ratio  scale, 
then  the  analogue  of  Equation  1  is 

u(kx,hy)  =  K{k,h)u(x,y) 

where  k  >  0,  h  >  0,  and  K(k,h)  >  0. 
We  know  by  Theorem  1  that  if  we 
hold  one  variable,  say  y,  fixed  at  some 
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value  and  leth=  1,  then  the  solution 
must  be  of  the  form 

u{x,y)  =  a(j)x^^«^ 

But  holding  x  constant  and  letting 
^  =  1,  we  also  know  that  it  must  be 
of  the  form 


Thus, 


u{x,y)  =  8(x)y''-^ 


a(y)x^^''^  =  8{x)y'^'^ 


If  we  restrict  ourselves  to  u's  having 
partial  derivatives  of  both  variables, 
this  equation  can  be  shown  (see  Sec- 
tion 2.C.2  of  Luce  [in  press])  to  have 
solutions  only  of  the  form  : 

u  {x,y)  =  ax^y'-^"^  '°8  ^ 

Thus,  the  principle  again  severely  re- 
stricts the  possible  laws,  even  when  we 
admit  more  than  one  independent 
variable.* 

It  must  be  emphasized  that  the 
remark  in  Footnote  3  does  not  apply 
here.  If  a  function  that  depends  upon 
one  independent  variable  is  added  to 
the  other,  e.g., 

u{x,y)  =  a(y)\ix  +  tCv)]^^"^ 

then  wholly  new  solution  possibilities 
exist  (see  Section  2.C.3  of  Luce  pn 
press]) . 

In  sum,  there  appear  to  be  two  ways 
around  the  restrictions  set  forth  in  the 
theorems.  The  first  can  be  viewed 
either  as  a  rejection  of  Part  2  of  the 
principle  or  as  the  creation  of  a  dimen- 
sionless  independent  variable  from  a 
ratio  scale ;  it  involves  the  presence  of 
dimensional  constants  that  cancel  out 

*  The  use  of  this  argument  to  arrive  at 
the  form  of  u{x,y)  seems  much  more  satis- 
factory and  convincing  than  the  heuristic 
development  given  in  Section  2.C  of  Luce  (in 
press),  and  the  empirical  suggestions  given 
there  should  gain  correspondingly  in  interest 
as  a  result  of  the  present  work. 


the  dimensions  of  the  independent 
variables.  This  appears  to  be  par- 
ticularly appropriate  if  the  dependent 
variable  has  a  true,  well-defined  bound. 
The  second  is  to  reject  the  idealiza- 
tion of  the  variables  as  numerical  con- 
tinua  and,  possibly,  to  assume  that 
they  are  bounded. 

On  the  other  hand,  if  the  theorems 
are  applicable,  then  the  possible  psy- 
chophysical (and  other)  laws  become 
severely  limited.  Indeed,  they  are  so 
limited  that  one  can  argue  that  the 
important  question  is  not  to  deter- 
mine the  forms  of  the  laws,  but  rather 
to  create  empirically  testable  measure- 
ment theories  for  the  several  psycho- 
physical methods  in  order  that  we  may 
know  for  certain  what  types  of  scales 
are  being  obtained.  Once  this  is 
known,  the  form  of  the  psychophysical 
functions  is  determined  e.xcept  for 
some  numerical  constants.  In  the 
meantime,  however,  experimental  de- 
terminations of  the  form  of  the  psy- 
chophysical functions  by  methods  for 
which  no  measurement  theories  exist 
provides  at  least  indirect  evidence  of 
the  type  of  scale  being  obtained.  For 
example,  the  magnitude  methods  seem 
to  result  in  power  functions,  which 
suggests  that  the  psychological  meas- 
ure is  either  a  ratio  or  logarithmic  in- 
terval scale,  not  an  interval  scale. 
Since  the  results  from  cross-modality 
matchings  tend  to  eliminate  the  loga- 
rithmic interval  scale  as  a  possibility, 
there  is  presumptive  evidence  that 
these  methods  yield  ratio  scales,  as 
Stevens  has  claimed. 

Summary 

The  following  problem  was  con- 
sidered. What  are  the  possible  forms 
of  a  substantive  theory  that  relates  a 
dependent  variable  in  a  continuous 
manner  to  an  independent  variable? 
Each  variable  is  idealized  as  a  nu- 
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TABLE  2 

The  Possible  Laws  Satisfying  the  Principle  of  Theory  Construction 


Scale  Types 

Possible  Laws 

Commenta* 

Independeat  Variable 

Dependent  Variable 

ratio 

ratio 

u{x)=caP 

&/x;p/u 

ratio 

interval 

M(x)=alog*+/3 

a/x 

u(x)=ax^+S 

&lx;&lu;h[x 

ratio 

log  interval 

u(x)=5^^ 

a/u\  P/x\  fi/u;  i/x 

u{x)=ay? 

?lx;?/u 

interval 

ratio 

impossible 

inter\'al 

interval 

«(x)=aX+/3 

/3/x 

interval 

log  interval 

u{x)=o^' 

a/x;p/u 

log  interval 

ratio 

impossible 

log  interval 

interval 

M(x)=otlogx+^ 

a/x 

log  interval 

log  interval 

m(x)=ox^ 

P/x;?/u 

«■  The  notation  a/x  means  "a  is  independent  of  the  unit  of  x.' 


merical  continuum  and  is  restricted  by 
its  measurement  theory  to  being  either 
a  ratio,  an  interval,  or  a  logarithmic 
interval  scale.  As  a  principle  of  the- 
ory construction,  it  is  suggested  that 
transformations  of  the  independent 
variable  that  are  admissible  under  its 
measurement  theory  shall  not  result 
in  inadmissible  transformations  of  the 
dependent  variable  (consistency)  and 
that  the  form  of  the  functional  rela- 
tion between  the  two  variables  shall 
not  be  altered  by  admissible  trans- 
formation of  the  independent  variable 
(invariance).  This  principle  limits  sig- 
nificantly the  possible  laws  relating 
the  two  continua,  as  shown  in  Table  2. 
These  results  do  not  hold  in  two  im- 
portant circumstances.  First,  if  the 
independent  variable  is  a  ratio  scale 
that  is  rendered  dimensionless  by 
multiplying  it  by  a  constant  having 
units  reciprocal  to  those  of  the  inde- 
pendent variable,  then  either  the  prin- 
ciple has  no  content  or  it  is  violated, 
depending  upon  how  one  wishes  to 
look  at  the  matter.  Second,  if  the 
variables  are  discrete  rather  than  con- 
tinuous, or  if  the  functional  relation  is 
discontinuous,  then  laws  other  than 
those  given  in  Table  2  are  possible. 
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MULTIVARIATE  INFORMATION  TRANSMISSION*! 
William  J.  McGill 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 

A  multivariate  analysis  based  on  transmitted  information  is  presented. 
It  is  shown  that  sample  transmitted  information  provides  a  simple  method 
for  measuring  and  testing  association  in  multi-dimensional  contingency 
tables.  Relations  with  analysis  of  variance  are  pointed  out,  and  statistical  tests 
are  described. 

Several  recent  articles  in  the  psychological  journals  have  shown  how 
ideas  derived  from  communication  theory  are  being  applied  in  psychology. 

It  is  not  widely  understood,  however,  that  the  tools  made  available  by 
communication  theory  are  useful  for  analyzing  data  whether  or  not  we 
believe  the  human  organism  is  best  described  as  a  communications  system. 

This  paper  will  present  an  extension  of  Shannon's  (10)  measure  of  trans- 
mitted information.  It  will  be  shown  that  transmitted  information  leads 
to  a  simple  multivariate  analysis  of  contingency  data,  and  to  appropriate 
statistical  tests. 

1.  Basic  Definitions 

Let  us  consider  a  communication  channel  and  its  input  and  output. 
Transmitted  information  measures  the  amount  of  association  between  the 
input  and  output  of  the  channel.  If  input  and  output  are  perfectly  correlated, 
all  the  input  information  is  transmitted.  On  the  other  hand,  if  input  and 
output  are  independent,  no  information  is  transmitted.  Naturally  most 
cases  of  information  transmission  are  found  between  these  extremes.  There  is 
some  uncertainty  at  the  receiver  about  what  was  sent.  Some  information  is 
transmitted  and  some  does  not  get  through. 

We  are  interested  not  in  what  the  transmitted  information  is,  but  in 
the  amount  of  information  transmitted.  Suppose  that  we  have  a  discrete 
input  variable,  a;,  and  a  discrete  output  variable,  y.  Since  x  is  discrete,  it 
takes  on  values  or  signals  k  =  1,  2,  3,  •  •  •  ,  X  with  probabilities  indicated 
by  p{k).  Similarly,  y  assumes  values  m  =  1,  2,  3,  •  •  •  ,  F  with  probabilities 
p{m).  If  it  happens  that  k  is  sent  and  m  is  received,  we  can  speak  of  the 
joint  input-output  event   (k,m).  This  joint  event  has  probability  p{k,m). 

*This  work  was  supported  in  part  by  the  Air  Force  Human  Factors  Operations 
Research  Laboratories,  and  in  part  jointly  by  the  Army,  Navy,  and  Air  Force  under 
contract  with  the  Massachusetts  Institute  of  Technology. 

fSeveral  of  the  indices  and  tests  discussed  in  this  paper  have  been  developed  in- 
dependently by  J.  E.  Keith  Smith  (11)  at  the  University  of  Michigan,  and  by  W.  R.  Garner 
at  Johns  Hopkins  University. 

This  article  appeared  in  Psychometrika,  1954,  19,  97-116.    Reprinted  with  permission. 
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The  rules  governing  the  selection  of  signals  at  either  end  of  the  channel  must 
be  constructed  so  that 

k-X  m-Y 

JLvik)  =  J2  P{m)  =   T.p{k,m)  =  1. 

*-l  m  =  l  k.m 

Under  these  conditions,  assuming  successive  signals  are  independent,  the 
amount  of  information  transmitted  in  "bits"  per  signal  is  defined  as 

T{x;y)  =  H(x)  +  H(y)  -  H{x,y),  (1) 

where 

Hix)  =  -  Zp(k)  log,  p{k), 

k 

H(y)  =  -  2  pM  log2  pirn), 
H(x,y)  =  -Y^pik,m)  \og2pik,m). 

k  ,m 

One  "bit"  is  equal  to  —logs  (^)  and  represents  the  information  conveyed  by 
a  choice  between  two  equally  probable  alternatives.  Our  development  will  use 
the  bit  as  a  unit,  since  this  is  the  convention  in  information  theory,  but 
any  convenient  unit  may  be  substituted  by  changing  the  base  of  the  logarithm. 

If  there  is  a  relation  between  x  and  y,  H(x)  +  H{y)  >  H{x,y)  and 
the  size  of  the  inequality  is  just  T{x;y).  On  the  other  hand,  if  x  and  y  are 
independent,  H(x,y)  =  H{x)  +  H{y)  and  T(x;y)  is  zero.  It  can  be  shown 
that  T{x;y)  is  never  negative. 

The  presentation  to  this  point  has  been  an  outline  of  the  properties  of 
the  measure  of  transmitted  information  as  set  forth  by  Shannon  (10).  These 
properties  may  be  summarized  by  stating  that  the  amount  of  information 
transmitted  is  a  bivariate,  positive  quantity  that  measures  the  association 
between  input  and  output  of  a  channel.  There  are,  however,  very  few  restric- 
tions on  how  a  channel  may  be  defined.  The  input-output  relations  that 
occur  in  many  psychological  contexts  are  certainly  possible  channels.  Con- 
sequently we  can  measure  transmitted  information  in  these  contexts  and 
anticipate  that  the  results  will  be  interesting. 

2.  Sample  Information 

Our  development  will  be  based  on  sample  measures  of  information,  i.e., 
on  measures  of  information  constructed  from  relative  frequencies. 

Suppose  that  we  make  n  observations  of  events  {k,m).  We  identify 
Ukm  as  the  number  of  times  that  k  was  sent  and  m  was  received.  This  means 
that 


k 

=  E 


n*. 


k  tTn 
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where  Uk  is  the  number  of  times  that  k  was  sent,  n„  is  the  number  of  times 
that  m  was  received,  and  n  is  the  total  number  of  observations.  A  particular 
experiment  can  then  be  represented  by  a  contingency  table  with  XY  cells 
and  entries  n^m  ■ 

We  may  estimate  the  probabilities,  p(fc),  p(w),  and  p{k,m)  with  Uk/n, 
n„/n,  and  Uk^/n,  respectively.  Sample  transmitted  information,  T'{x',y),  is 
defined  as 

T'ix;y)  =  H'{x)  +  H'{y)  -  H'{x,y),  (2) 

where  H'(x),  H'{y)  and  H'{x,y)  are  constructed  from  relative  frequencies 
instead  of  from  probabilities.  [Throughout  the  paper  a  prime  is  used  over  a 
quantity  to  indicate  the  maximum  likelihood  estimator  of  the  same  quantity 
without  the  prime,  e.g.,  T'{x;y)  is  an  estimator  for  T{x\y).]  As  before,  T'(x;y) 
is  the  amount  of  transmitted  information  (in  the  sample)  measured  in  "bits" 
per  signal. 

Since  it  is  difficult  to  manipulate  logs  of  relative  frequencies,  we  will 
introduce  an  easier  notation: 

Sfc-    =    ~     S  ^A-n    10g2  nkm    , 

n    k 
Sm  =  -^n„  log2  n„  , 

s  =  logs  n. 

Expressions  involving  sample  measures  of  information  are  easier  to 
handle  in  this  notation.  For  example,  T'{x;y)  becomes 

T'{x;y)  =  s  -  Si  -  s„  +  Si„  .  (3) 

Equations  (2)  and  (3)  are  equivalent  expressions  for  T'{x;y).  When 
we  write  equations  like  (3),  we  shall  say  that  these  equations  are  written  in 
s-notation.  Thus  (3)  is  (2)  in  s-notation. 

3.  Three-Dimensional  Transmitted  Information 

Now  let  us  extend  the  definition  of  transmitted  information  to  include 
two  sources,  u  and  v,  that  transmit  to  y.  To  accomplish  this  we  replace  x 
in  equation  (2)  with  u,  v  and  we  find  that 

riu,v;y)  =  H'{u,v)  +  H'(y)  -  H'(u,v,y),  (4) 

where  x  has  been  subdivided  into  two  classes,  u  and  v.  The  possible  values 
of  u  are  i  =  1,2,  3,  •  •  •  ,  U,  while  v  assumes  values  j  =  1,  2,  3,  •  •  •  ,  F.  The 
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subdivision  is  arranged  so  that  the  range  of  values  of  u  and  v  jointly  constitute 
the  possible  values  of  x.  This  means  that  the  input  event,  k,  can  be  replaced 
by  the  joint  input  event  {i,j).  Consequently  we  have 

rik  =  Uij  , 

and  the  direct  substitution  of  u,v  for  x  in  (2)  is  legitimate. 

Our  new  term,  T'{u,v;y),  measures  the  amount  of  information  trans- 
mitted when  u  and  v  transmit  to  y.  It  is  evident,  however,  that  the  direction 
of  transmission  is  irrelevant,  for  examination  of  (4)  reveals  that 

T'{u,v;y)  =  T'{y;u,v). 

This  means  that  nothing  is  gained  formally  by  distinguishing  transmitters 
from  receivers.  The  amount  of  information  transmitted  is  a  measure  of 
association  between  variables.  It  does  not  respect  the  direction  in  which 
the  information  is  travelling.  On  the  other  hand,  we  cannot  permute  symbols 
at  will,  for 

riu,y;v)  =  H'{u,y)  +  H\v)  -  H'{iL,v,y), 

and  this  is  not  necessarily  equal  to  T'(u,v;y). 

Our  aim  now  is  to  measure  T'(u,v;y)  and  then  to  express  T'{u,v;y)  as 
a  function  of  the  bivariate  transmissions  between  u  and  y,  and  v  and  y. 
Computation  of  T'{u,v;y)  is  not  difficult.  Our  observations  of  the  joint 
event  {i,j,m)  organize  themselves  into  a  three-dimensional  contingency  table 
with  UVY  cells  and  entries  n.,„  ,  We  can  compute  the  quantities  in  (4)  from 
this  table,  or  we  can  write 

T'{u,v',y)  =  s  —  Sm  —  Sii  +  Siim  ,  (5) 


where 

Sii, 


-   Z)  nam  log2  fiii, 


and  the  other  s-terms  are  defined  by  analogy  with  the  s-terms  in  equation  (3). 
Now  suppose  we  want  to  study  transmission  between  u  and  y.  We 
may  eliminate  v  in  two  ways.  First  let  us  reduce  the  three-dimensional 
contingency  table  to  two  dimensions  by  summing  over  v.  The  entries  in  the 
reduced  table  are 

;' 

We  have  for  the  transmitted  information  between  u  and  y, 

T'{u;y)  =  s  -  s,  -  s„  -{-  s,^  .  (6) 

The  second  way  to  eliminate  v  is  to  compute  the  transmission  between  u  and 
y  separately  for  each  value  of  v  and  then  average  these  together.  This  trans- 
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mitted  information  will  be  called  T[{u\y),  where 

T'Xu;y)  =   T.'^m{u;y)l  (7) 

and  T'j{u;y)  is  information  transmitted  between  u  and  y  for  a  single  value 
of  V,  namely  j.  It  is  readily  shown  that 

T'Xu;y)  =  sy  -  s^  -  s,„,  +  s^^  .  (8) 

We  see  that  T[{u;y)  is  written  in  the  same  way  as  T'{u;y)  except  that  the 
subscript  j  is  added  to  each  of  the  s-terms. 

There  are  three  different  pairs  of  variables  in  a  three-dimensional  con- 
tingency table.  For  example,  the  two  equations  for  transmission  between 
V  and  y  are  written 

T'{v;y)  =  s  -  s;  -  s„  +  s,„  ,  (9) 

TL{v;y)  =  s,  -  s,,-  -  s,„  +  s,,„  .  (10) 

Finally  we  may  study  transmission  between  u  and  v,  i.e., 

r{u;v)  ^  s  -  Si  -  s,  -h  Sii  ,  (11) 

T'y{u;v)  =  s^  -  Si^  -  Sir,  +  Siy„  .  (12) 

With  these  results  in  mind  let  us  reconsider  the  information  transmitted 
between  u  and  y.  If  v  has  an  effect  on  transmission  between  u  and  y,  then 
T[{u;y)  ^  T'{u;y).  One  way  to  measure  the  size  of  the  effect  is  by 

A'{uvy)  =  T'Xu;y)  -  T'{u;y), 

A'{uvy)  =  — s  +  Si  +  Si  -\-  Sm  —  Sii  —  Si„,  —  Sy„  +  s.y„  .  (13) 

A  few  more  substitutions  will  show  that 

A\uvy)  -  T'Xu;y)  -  T'{u;y), 

=  TL{v;y)  -  T'{v;y),  (14) 

=  Tl{u;v)  -  r{u;v). 

In  view  of  this  symmetry,  we  may  call  A'{uvy)  the  u-v-y  interaction  informa- 
tion. We  see  that  A'{uvy)  is  the  gain  (or  loss)  in  sample  information  trans- 
mitted between  any  two  of  the  variables,  due  to  additional  knowledge  of  the 
third  variable. 

Now  we  can  express  the  three-dimensional  information  transmitted 
from  u,v  to  y,  i.e.,  T'(u,v;y),  as  a  function  of  its  bivariate  components,  for 

r(:u,v;y)  =  T{u;y)  +  T'{v',y)  +  A'iuvy),  (15) 

T'{u,v;y)  =  T'Xu;y)  -j-  Ti{v;y)  -  A'{uvy).  (16) 
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Equations  (15)  and  (16)  taken  together  mean  that  T'{u,v\y)  can  be  represented 
by  a  diagram  with  overlapping  circles  as  shown  in  Figure  1.  The  diagram 
assumes  what  we  shall  call  "positive"  interaction  between  u,v  and  y.  Inter- 


"■"v^^iy)^        ^Tu(v;y) 


T'(u,v,  y) 

Figure  1 
Schematic  diagram  of  the  components 
of  three-dimensional  transmitted  in- 
formation. The  diagram  shows  that 
three-dimensional  transmission  can  be 
analyzed  into  a  pair  of  bivariate  trans- 
missions plus  an  interaction  term. 
The  meanings  of  the  symbols  are  ex- 
plained in  the  text. 

action  is  positive  when  the  effect  of  holding  one  of  the  interacting  variables 
constant  is  to  increase  the  amount  of  association  between  the  other  two. 
This  means  that  T'Xu;y)  >  T'{u;y)  and  TL(v;y)  >  T'(v;y).  [Because  of  (14), 
if  one  of  these  inequalities  holds,  both  must  hold.]  Later  on,  however,  we 
shall  show  that  interaction  may  be  negative.  When  this  happens,  relations 
between  the  interacting  variables  are  reversed,  and  the  diagram  in  Figure  1 
is  no  longer  strictly  correct. 

4.  Components  of  Response  Information 

The  multivariate  model  of  information  transmission  is  useful  to  us 
because  the  situations  treated  by  communication  theory  are  not  the  same  as 
those  we  deal  with  in  psychological  appUcations.  The  engineer  is  usually  able 
to  restrict  himself  to  transmission  from  a  single  information  source.  He 
knows  the  statistical  properties  of  the  source,  and  when  he  speaks  of  noise  he 
means  random  noise.  This  kind  of  precision  is  seldom  available  to  us.  In  our 
experiments  we  generally  do  not  know  in  advance  how  many  sources  are 
transmitting  information.  We  must  therefore  be  careful  not  to  confuse 
statistical  noise  with  the  experimenter's  ignorance. 

The  bivariate  model  of  transmitted  information  provided  by  communi- 
cation theory  tells  us  to  attribute  to  random  noise  whatever  uncertainty  there 
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is  in  specifying  the  response  when  the  stimulus  is  known  (1).  Consequently, 
if  several  sources  transmit  information  to  responses,  the  bivariate  model 
will  certainly  fail  to  discriminate  effects  due  to  uncontrolled  sources  from 
those  due  to  random  variability.  On  the  other  hand,  the  multivariate  model 
can  measure  the  effects  due  to  the  various  transmitting  sources.  For  example, 
in  three-dimensional  transmission  we  find  that 

H\y)  =  HUy)  +  T'{u;y)  +  r{v;y)  +  A'{uvy),  (17) 

where  H'(y)  =  s  —  s„  and  H'u^(y)  =  s,,-  —  Si,„,  . 

We  see  that  H'(y),  the  response  information,  has  been  analyzed  into 
an  error  term  plus  a  set  of  correlation  terms  due  to  the  input  variables.  The 
error  term,  HiXv),  is  the  residual  or  unexplained  variability  in  the  output, 
y,  after  the  information  due  to  the  inputs,  u  and  v,  has  been  removed.  In 
bivariate  information  transmission,  the  response  information  is  analyzed  less 
precisely.  For  example,  we  may  have 

Ii'{y)  =  H'M  +  T\u;y).  (18) 

In  this  case  the  error  term  is  Hi{y)  because  only  one  input,  u,  is  recorded. 
Shannon  (10)  showed  that 

H'M  ^  HUy)- 

In  other  words  the  error  term,  when  only  u  is  controlled,  cannot  be  increased 
if  we  also  control  v.  In  fact 

HL{y)  =  HUy)  +  n{v;y).  (19) 

Equation  (19)  is  proved  by  expanding  both  sides  in  s-notation.  Thus 
if  u  and  v  are  stimulus  variables  that  transmit  information  via  responses,  y, 
we  have  an  error  term;  H^{y),  provided  we  keep  track  of  only  one  of  the 
inputs,  namely,  u.  However,  this  error  term  contains  a  still  smaller  error 
term  as  well  as  the  information  transmitted  from  v.  Controlling  v  is  thus 
seen  to  be  equivalent  to  extracting  the  association  between  v  and  y  from  the 
noise.  Multivariate  transmitted  information  is  essentially  information 
analyzed  from  the  noise  part  of  bivariate  transmission. 

5.  An  Example 

The  kind  of  analysis  that  multivariate  information  transmission  yields 
can  be  illustrated  by  a  set  of  data  obtained  from  one  subject  in  an  experiment 
on  frequency  judgment. 

Four  equally  loud  tones,  890,  925,  970,  and  1005  cycles  per  second 
were  presented  to  the  subject  one  at  a  time  in  random  order.  Each  tone  was 
I  second  long  and  separated  by  about  3  seconds  from  the  next  tone.  During 
preliminary  training  the  subject  learned  to  identify  the  tones  by  pairing  them 
with  four  response  keys.  In  experimental  sessions,  a  loud  masking  noise  was 
turned  on  and  a  random  sequence  of  250  tones  was  presented  against  the 
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noise  background.  A  flashing  light  told  the  subject  when  the  stimulus  occurred, 
and  he  was  instructed  to  guess  if  in  doubt  about  which  one  of  the  four  tones 
it  was. 

One  object  of  the  experiment  was  to  find  weights  for  both  the  frequency 
stimulus  and  the  immediately  preceding  response  in  determining  which  key 
the  subject  would  press.  Tests  were  run  at  several  signal-to-noise  ratios. 
The  data  presented  here  were  obtained  when  the  signal-to-noise  ratio  was 
close  to  the  masked  threshold. 

In  order  to  calculate  weights,  we  can  consider  the  experiment  as  an 
example  of  three-dimensional  transmission.  Our  analysis  is  based  on  the 
responses  to  the  125  even-numbered  stimuli.  The  odd-numbered  responses  are 
considered  as  the  context  in  which  the  subject  judged  the  even-numbered 
stimuli.  The  odd-numbered  stimuli  are  ignored  in  this  analysis. 

The  stimuli  will  be  designated  as  the  variable  u.  Last  previous  responses 
are  called  "presponses"  and  they  will  be  indicated  by  the  variable  v.  These 
are  the  inputs.  Current  responses  are  represented  by  y.  This  is  the  output 
variable.  Thus  we  can  identify  the  joint  event  (i,j,m)  as  the  occurrence  of 
response  m  to  stimulus  i,  following  presponse  j.  Failure  to  respond  is  con- 
sidered as  a  possible  response.  Consequently  there  are  four  stimulus  cate- 
gories and  five  response  categories. 

The  subject's  responses  to  the  125  test  stimuli  were  sorted  into  a 
4X5X5  contingency  table.  Two  of  the  reduced  tables  that  were  obtained 
from  this  master  table  are  reproduced  here  in  order  to  illustrate  our  com- 


TABLE  1 
Stimulus-Response  Frequency  Table 


TABLE  2 
Presponse-Hesponse  Frequency  Table 
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putations.  For  example,  the  Stimulus-Response  plot  in  Table  1  has  entries 
w,„  .  The  calculation  for  s,>,  goes  as  follows: 

s.„  =  -^-  [1  log2  1  +  5  log2  5+12  log2  12  +  •  •  •  +  7  log^  7+10  log^  10], 

s,„  =  374.05750/125, 

sv.  =  2.99246. 

In  the  same  way,  s,„  is  computed  from  the  figures  for  nj„  in  the  Presponse- 
Response  table,  Table  2: 

sy„  =  ^n  log2  1  +  1  log2  1  +  2  log2  2  +  •  •  •  +  9  log2  9  +  3  log,  3], 

s^^  =  372.38710/125, 
s,^  =  2.97910. 
We  obtain  the  value  for  s,  from  the  n.  in  the  bottom  marginal  of  Table  1: 

s,  =  yI^  [31  log2  31  +  30  logs  30  +  33  log^  33  +  31  log^  31], 

Si  =  620.83188/125, 

Si  =  4.96665. 

The  computation  for  s  is  based  on  the  total  number  of  measurements: 

s  -  log2  125  =  6.96579. 

It  is  evident  that  these  calculations  are  performed  very  easily  with 
a  table  of  n  log,  n.  If  he  wishes,  the  reader  may  also  make  the  computations 
with  tables  of  p  log,  p  like  those  prepared  by  Newman  (8),  and  Dolansky  (3). 
The  use  of  p  log,  p  tables  for  analyzing  discrete  data  is  not  recommended, 
however,  because  it  leads  to  rounding  errors  that  the  table  of  n  log,  n  avoids. 
The  complete  set  of  s-terms  in  the  experiment  on  frequency  judgment  worked 
out  as  follows: 

s.,«  =  1.45211         s,  =  4.96665 

s.,  =  2.91389         s,  =  4.79269 

Sir,  =  2.99246        s„  =  4.93380 

s,„  =  2.97910  s  =  6.96579 

In  section  4  it  was  shown  that  response  information,  H'(y),  can  be 
analyzed  into  components 

H'{y)  =  HUy)  +  T\u;y)  +  T  {v;y)  +  A'{uvy).  (17) 
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Since  H'{y)  =  s  -  s„  ,  we  see  that  H\y)  =  2.03199  bits.  If  the  subject 
had  used  the  four  response  keys  equally  often,  this  figure  would  have  been 
at  most  2  bits.  The  extra  information  shows  that  the  subject  sometimes  did 
not  respond.  This  can  be  verified  from  the  right-hand  marginals  in  Tables  I 
and  2.  The  rest  of  the  quantities  in  equation  (17)  are  easily  computed  from 
s-terms.  For  example,  Hl,{y)  is  computed  from  s.,-  —  s,,„  .  We  see  that 
HlXy)  is  1.46178  bits.  This  is  the  part  of  the  response  information  that 
is  not  accounted  for  either  by  the  auditory  stimuli  or  the  presponses.  Con- 
sequently, 1.46178/2.03199  or  72  per  cent  of  the  response  information  is 
unanalyzed  error.  Some  28  per  cent  of  the  response  information  must  therefore 
be  due  to  associations  between  the  subject's  responses  and  the  two  predicting 
variables. 

If  we  consider  the  association  between  auditory  stimuli  (w)  and  responses 
(y),  we  have 

T'{u;y)  =  s  -  Si  -  Srr. -\-  s,„  , 

T'{u;y)  =  .05780. 

Thus  only  .058  bits  are  transmitted  from  the  frequency  stimuli,  accounting  for 
less  than  3  per  cent  of  the  response  information.  This  is  not  surprising  because 
the  signal-to-noise  ratio  was  set  near  the  masked  threshold  and  the  stimuli 
were  difficult  to  hear. 

If  we  consider  the  association   between  presponses   (v)   and   current 
responses  {y),  we  find  a  little  more  transmitted  information: 

T'(v;y)  =  s  -  s,-  -  s„  4-  s,™  , 

r(v;y)  =  .21840. 

This  value  of  .218  bits  transmitted,  amounts  to  some  11  per  cent  of  the 
response  information. 

The  last  element  in  equation  (17)  is  the  stimulus  X  response  X  presponse 
interaction,  A'{uvy).  This  is  computed  from 

A'{uvy)  =  —s  -\-  Si  +  s,  +  s„  —  Sii  —  s,„.  —  Sy„  +  s,-,,„  , 

A'iuvy)  =  .29401. 

We  see  that  about  14  per  cent  of  the  response  information  is  due  to  the 
interaction.  Knowledge  of  the  interaction  also  permits  us  to  hold  one  of  the 
inputs  constant  while  measuring  transmission  from  the  other  input.  For 
example,  the  transmission  from  stimuli  to  responses  with  presponses  held 
constant  is: 

Ti{u;y)  =  s,-  -  Sa  -  s,„  +  s,i„ 

=  T'{u;y)  +  A'{uvy) 

=  .35181. 
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Our  calculations  for  the  parts  of  the  response  information  that  we 
can  analyze  with  the  three-dimensional  model,  lead  to  weights  of  approxi- 
mately 3,  11  and  14  per  cent  for  stimuli,  presponses  and  interaction  respec- 
tively. These  figures  sum  to  28  per  cent,  the  amount  of  transmitted  informa- 
tion we  predicted  from  the  size  of  the  noise  term.  We  can  also  obtain  this  total 
weight  directly  by  computing  the  information  transmitted  from  both  inputs 
together.  We  have 

T'{u,v\y)  ^  s  —  Sm  —  Sii  +  Sii^ 

T'(u,v;y)  =  .57021. 

If  we  now  divide  this  three-dimensional  transmitted  information  by  the 
response  information,  we  get  back  our  figure  of  28  per  cent. 

There  are  several  points  worth  noting  about  our  application  of  informa- 
tion theory  to  this  experiment.  The  first  is  that  the  analysis  is  additive. 
The  component  measures  of  association  plus  the  measure  of  error  (or  noise) 
sum  to  the  response  information.  Furthermore,  the  analysis  is  exact.  No 
approximations  are  involved.  The  process  is  very  similar  to  the  partition 
of  a  sum  of  squares  in  analysis  of  variance.  As  a  matter  of  fact,  a  notation 
can  be  worked  out  in  analysis  of  variance  that  is  exactly  parallel  to  the 
s-notation  in  multivariate  information  transmission  (4). 

The  second  point  is  that  information  transmission  is  made  to  order 
for  contingency  tables.  Measures  of  transmitted  information  are  zero  when 
variables  are  independent  in  the  contingency-sense  (as  opposed  to  the  restric- 
tion to  linear  independence  in  analysis  of  variance).  In  addition,  the  analysis 
is  designed  for  frequency  data  in  discrete  categories,  while  methods  based  on 
analysis  of  variance  are  not.  No  assumptions  about  linearity  are  introduced 
in  multivariate  information  transmission.  Furthermore,  when  statistical 
tests  are  developed  in  a  later  section,  it  will  be  shown  that  these  tests  are 
distribution-free  in  the  sense  that  they  are  extensions  of  the  familiar  chi- 
square  test  of  independence. 

The  measure  of  amount  of  information  transmitted  also  has  certain 
inherent  advantages.  Garner  and  Hake  (2)  and  Miller  (5)  have  pointed  out 
that  the  amount  of  information  transmitted  is  approximately  the  logarithm 
of  the  number  of  perfectly  discriminated  input-classes.  In  experiments  on 
discrimination  Like  the  one  we  have  discussed,  the  measure  provides  an 
immediate  picture  of  the  subject's  discriminative  abihty.  Miller  has  also 
discussed  apphcations  of  this  property  in  mental  testing  and  in  the  general 
theory  of  measurement. 

6.  Independence  in  Three-Dimensional  Transmission 
It   is   evident    from    the    definition    of   transmitted    information    that 
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T'(u,v;y)  =  0  when  the  output  is  independent  of  the  joint  input,  i.e.,  when 

«.„  =  ^^^^.  (20) 

n 

With  this  kind  of  independence,  we  can  show  that 

Siim     =    Sii    -\-    Sm     -    S. 

This  expression  for  s,-,„,  may  be  substituted  into  (5)  to  confirm  the  fact  that 
r(u,v',y)  =  0. 

Now  suppose  that  T'{u,v;y)   >  0  but  that  v  and  y  are  independent, 
that  is  to  say, 

n,„  =  ^^^^^.  (21) 

n 

This  leads  to 

If  we  substitute  for  s,„,  in  equation  (9),  we  find  that  T'{v;y)  =  0.  Equation 
(21)  does  not  provide  a  unique  condition  for  independence  between  v  and  y. 
To  show  this,  let  us  pick  some  value  of  u  and  study  the  v-to-y  transmission 
at  that  value  of  u.  We  now  require  that 

n„.  =  'isrSA^.  (22) 

If  we  have  (22)  for  all  i,  we  must  have 

and  it  follows  from  substitution  in  (10)  that  T^{v;y)  =  0.  This  is  the  situation 
in  which  v  and  y  are  independent  provided  that  u  is  held  constant.  It  is  an 
interesting  case  because  we  can  show  from  (14)  that  if  this  kind  of  independ- 
ence happens, 

A'iuvy)  =  -T'{v;y). 

The  sign  of  T'{v;y)  must  be  positive  or  zero  so  that  —  T'{v\y)  must  be  negative 
or  zero.  Consequently,  A'(uvy)  can  be  negative.  We  see  that  negative  inter- 
action information  is  produced  when  the  information  transmitted  between  a 
pair  of  variables  is  due  to  a  regression  on  a  third  variable.  Holding  the  inter- 
acting variable  constant  causes  the  transmitted  information  to  disappear. 
If  we  have  the  independence  defined  by  (21),  we  may  not  necessarily 
have  the  independence  defined  by  (22).  Let  us  suppose  that  we  have  both,  i.e., 
that  we  have 

S;„      =      Si      +     S„      —     S, 

Siim    ^    Sii    "I      Sim  Si    . 


96  READINGS  IN  MATHEMATICAL  PSYCHOLOGY 

Now  we  substitute  for  s,„.  and  s,-,„.  in  equation  (8). 

ri(w;y)    =   Si    -  Sii    -   Si,^  +  Sii„  , 

Ti{u;y)  =  s,-  —  s^  -  Sj  -  s^,,  -\-  s  +  s^  +  Si^  —  Si  , 
T'Xu',y)  =  s  —  Si  —  s„  +  Sim  , 
rXu;y)  =  T'{u',y). 

Both  kinds  of  independence,  (21)  and  (22),  together  mean  that  v  is  not 
involved  in  transmission  between  u  and  y.  When  this  happens  we  do  not 
have  three-dimensional  transmission,  since  u  is  the  only  input  variable 
(provided  that  no  information  is  transmitted  between  u  and  v).  As  might 
be  expected,  both  kinds  of  independence  can  be  generated  from  a  single 
restriction  on  the  data,  namely 


Uiim    = 


F' 


where  V  is  the  number  of  classes  in  v. 

We  have  studied  the  case  where  v  is  independent  of  y.  We  could  have 
had  u  independent  of  y,  or  u  independent  of  v.  The  results  are  analogous  to 
those  we  have  presented. 

7.  Correlated  Sources  of  Information 

Three-dimensional  transmitted  information,  T'{u,v;y),  accounts  for 
only  part  of  the  total  amount  of  association  in  a  three-dimensional  contingency 
table.  It  does  not  exhaust  all  the  association  in  the  table  because  it  neglects 
the  association  between  the  inputs.  When  this  association  is  considered,  i.e., 
when  all  the  relations  in  the  contingency  table  are  represented,  we  are  led  to 
an  equation  that  is  very  useful  for  generating  the  components  of  multivariate 
transmission.  Consider 

C'{u,v,y)  =  H'iu)  +  H\vj  +  H'iy)  -  H'{u,v,y).  (23) 

If  we  add  and  subtract  H'{u,v),  we  obtain 

C'iu,v,y)  =  T'{u',v)  +  r{u,v;y), 

C'{u,v,y)  =  T'{u]v)  -f  T'{u;y)  -f  T'{v;y)  +  A'{uvy).  (24) 

We  see  that  C'(u,v,y)  generates  all  possible  components  of  the  three  corre- 
lated information-sources,  u,  v,  and  y. 

8.  Four-Dimensional  Transmitted  Information 

It  will  be  instructive  to  extend  our  measures  one  step  further,  i.e.,  to 
transmitted  information  with  three  input  variables,  since  from  that  point 
results  can  be  generaUzed  easily  to  an  iV-dimensional  input.  For  simplicity 
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we  shall  restrict  our  development  to  the  case  of  a  channel  with  a  multivariate 
input  and  a  univariate  output.  The  more  general  case  with  A^  inputs  and 
M  outputs  does  not  present  any  special  problems,  and  can  be  constructed 
with  no  difficulty  once  the  rules  become  clear. 

Let  us  add  a  new  variable  w  to  the  bivariate  input,  u,v.  The  joint  input 
is  now  u,v,w.  We  suppose  that  w  sends  signals  h  =  1,2,3,  •••  W.  This  gives 
us  four  sources  of  information  u,v,w,  and  y.  We  can  proceed  to  define  a  four- 
way  interaction  information,  A'{uvwy),  as  follows: 

A'iuvwy)  =  Al{uvy)  —  A'{uvy). 

We  have  already  defined  A'{uvy).  The  definition  of  Ai{uvy)  will  be  similar 
except  that  the  subscript  w  indicates  that  A'(uvy)  is  to  be  averaged  over  w. 
As  we  have  already  noted,  this  is  accomplished  by  adding  the  subscript  h  to 
each  of  the  s-terms  that  make  up  A'(uvy).  Consequently 

Ai{uvy)  =  —Sh  +  Shi  +  Ski  +  Sh„  -  Shii  —  Sa,„  -  Sa,-„  +  Snum  .        (25) 

It  is  readily  shown  that  A'{uvwy)  is  symmetrical  in  the  sense  that  it  does  not 
matter  which  variable  is  chosen  for  averaging,  i.e.. 


(26) 


We  see  that  A'{uvwy)  is  the  amount  of  information  gained  (or  lost)  in  trans- 
mission by  controlling  a  fourth  variable  when  any  three  of  the  variables  are 
already  known. 

If  we  examine  all  possible  associations  in  a  four-dimensional  contingency 
table,  we  obtain 

C\u,v,w,y)  =  T'(u;v)  +  T'{u;w)  +  T'{u;y)  +  T'{v;w)  +  T'(v;y)  +  r(iv;y) 

+  A'{uvw)  +  A'{uvy)  +  A'{uwy)  +  A'{vwy)  +  A'{uvwy),         (27) 
where 

C'{u,v,w,y)  =  H'{u)  +  H'{v)  +  H'iw)  +  H'^y)  -  H'{u,v,w,y). 

Equation  (27)  can  be  proved  by  expanding  both  sides  in  s-notation. 
It  turns  out  that  in  the  general  case,  C'(u,v,w,  ■  ■  ■  ,  y)  is  expanded  by  writing 
down  T-terms  for  all  possible  pairs  of  variables,  and  .4 -terms  for  all  possible 
combinations  of  three,  four  variables  and  so  on. 

Four-dimensional  transmitted  information  from  u,v,w  to  y,  i.e., 
T'{u,v,w;y),  can  be  written  as  follows: 

T'(u,v,w]y)  =  H'{y)  +  H\u,v,io)  -  H'{u,v,w,y).  (28) 


'{uvwy] 

)  =  Ai{vwy)  - 

-  A'{vwy), 

=  Ai{uwy)  - 

-  A'{uwy), 

=  Ai{uvy)  - 

-  A'{uvy), 

=  A'y{uvw)  - 

-  A'{uvw). 
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The  same  arguments  are  used  to  justify  (28)  as  were  used  in  the  case  of  (4) 
in  three-dimensional  transmission.  To  find  the  components  of  T'{u,v,w]y), 
we  note  that 

T'{u,v,w;y)  =  C'{u,v,w,y)  -  C\u,v,w).  (29) 

This  means  that  T'(u,v,w',y)  contains  all  the  components  of  C'{u,v,w,y)  except 
the  correlations  among  the  inputs.  Consequently  the  components  of 
T'(u,v,w;y)  are  ■ 

T'{u,v,w;y)  =  T'{u;y)  +  T'{v;y)  +  r{w;y) 

+  A^uvy)  +  A'{uwy)  +  A'{vwy)  +  A'{uvwy).        (30) 

The  components  of  T'(u,v,w;y)  are  shown  in  schematic  form  in  Figure  2. 


T'(u,v,wiy) 

Figure  2 
Schematic  diagram  of  the  components 
of    four-dimensional    transmitted    in- 
formation, with  three  transmitters  and 
a  single  receiver. 

If  it  happens  that 

rihii^  =  nii„/W, 

where  W  is  the  number  of  classes  in  w,  all  the  components  of  C'(u,v,w,y)  that 
are  functions  of  w  drop  out  and  C'(u,v,w,y)  =  C'{u,v,y).  In  similar  fashion, 
C'{u,v,y)  can  be  reduced  to  C'{u,y).  This  is  precisely  what  we  did  in  the 
analysis  of  independence  in  three-dimensional  transmitted  information.  Since 
C'{u,y)  =  T'(u;y),  we  see  that  all  cases  of  transmission  with  multivariate 
inputs  can  be  related  to  the  bivariate  case. 

With  three  inputs  controlled,  we  are  ready  to  extend  the  analysis  of 
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response  information  in  section  4,  a  step  further.  We  have 

H'{y)  =  HLM  +  T'{u,v,w;y).  (31) 

Equation  (31)  says  that  we  can  measure  the  effects  in  response  information 
due  to  the  three  inputs.  This  is  evident  from  the  fact  that  (30)  tells  us  how 
to  expand  T'{u,v,w;y)  in  its  components.  In  addition  we  know  that 

HUy)  =  mUy)  +  TUw;y),  (32) 

where 

TUw,y)  =  T'{w;y)  +  A'{uwy)  +  A'{vwy)  +  A'{uvwy).  (33) 

We  see  that  controlling  w  in  addition  to  u  and  v,  enables  us  to  rescue  the 
information  transmitted  between  w  and  y  from  the  noise,  and  to  replace 
H'uviy)  with  a  better  estimate  of  noise  information,  namely  H^,^{y). 

The  transition  to  an  A^'-dimensional  input  is  now  evident.  In  general, 
we  have 

H'(y)  =  HL^...Xy)  +  T{u,v,w,  ■■■  ,z;y).  (34) 

The  (N  -\-  1) -dimensional  transmitted  information,  T'(u,v,w,  •  •  •  ,  z;y)  can 
then  be  expanded  in  its  components  in  the  manner  that  we  have  described. 

9.  Asymptotic  Distributions 

Miller  and  Madow  (6)  have  shown  that  sample  information  is  related 
to  the  likehhood  ratio.  Following  Miller  and  Madow,  we  can  show  that  the 
large  sample  distribution  of  the  likelihood  ratio  may  be  used  to  find  approxi- 
mate distributions  for  the  quantities  involved  in  multivariate  transmission. 

Consider,  for  example,  three-dimensional  sample  transmitted-informa- 
tion,  T'(u,v;y).  We  can  test  the  hypothesis  that  T(u,v;y)  is  equal  to  zero. 
This  is  equivalent  to  the  hypothesis  that 

V(i,j,in)  =  PihJ)'Pi'>n),  (35) 

since  T{u,v;y)  is  zero  when  input  and  output  are  independent.  This  hypothesis 
leads  to  the  likelihood  ratio  [see  reference  (7)], 

ri-"  n  {n,r'  n  W"- 

n      11    inurn) 

If  we  take  logs,  we  obtain 

-2  log.  X  _     _       _         , 
1.3863  n~  ^      ^'"       ^''  "^    •■'■'"  '  (37) 

-2  log.  X  =  1.3863nT'iu,v;y). 
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For  large  samples,  —2  log.  X  has  approximately  a  x  distribution  with 
{UV  —  1)(F  —  1)  degrees  of  freedom  when  the  null  hypothesis  (35)  is  true. 
Thus  1.3863  nT'{u,v',y)  is  distributed  approximately  like  x  if  T(ti,v;y)  is 
equal  to  zero. 

A  more  important  problem  involves  testing  suspected  information 
sources.  Suppose  in  our  three-dimensional  example,  we  assume  that 

P(hj>  m)  =  p(i)  -vU)  •  p{m) .  (38) 

This  hypothesis  leads  to  the  likelihood  ratio  for  complete  independence  in  a 
three-dimensional  contingency  table, 

n-'"  n  (nr  n  (nr  U  (n^'" 
X  =  ' ==r^ .  (39) 

n  "   11   (ni,J"'"" 

After  we  take  logs  we  find  that 

—2  loge  \  ^  Ss  —  Si  —  Sj  —  Sm  —  s  +  s,-,-„ 

=  H\u)  +  H'(v)  -f  Wiy)  -  H'(u,v,y)  (40) 

=  l.S8QSnC'{u,v,y). 

For  large  samples    —2  log,  X  has  approximately  a  x^  distribution  with 
{UVY  -  1)  -  {U  -  I)  -  (V  -  1)  -  (Y  -  1)  degrees  of  freedom  when  the 
null  hypothesis  is  true. 
We  also  know  that 

C'{u,v,y)  =  T'{u;y)  +  r{v;y)  +  T'Mv)^  (41) 

The  likelihood  ratio  can  be  used  to  show  that  1.3863  nT'(u',y)  and 
1.3863  nT'{v;y)  are  asymptotically  distributed  like  x  with  {U  -  1){Y  -  1) 
and  {V  —  \){Y  —  1)  degrees  of  freedom,  respectively,  if  T{u;y)  and  T{v;y) 
are  zero.  To  find  the  asymptotic  distribution  of  Ty{u;v),  we  make  the  following 
hypothesis: 

p(i,j,m)  =  Vihm)-Pm{j),  (42) 

where  p^ij)  is  the  conditional  probability  of  j  given  m. 
Now  we  have  the  ratio 


n 


■'  n  (n..)"-  n  (H"" 


X  =  '-     ^ '•'""^"'^      ,  (43) 

n      li    {ni,J 

—  2  log,  X  _       _         _  ,  , .  .V 

1.3863  n  ~  ^^  ~  ^""       *'■'"  "^  ^'■''"  '  ^    ^ 

-2  log.  X  =  1.3863nri(M;«). 
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In  this  case  —2  log.  \  has  Y{U  —  1)(V  —  1)  degrees  of  freedom.  In  view 
of  (41)  we  can  write 

1.3863nC'(w,v,i/)  =  1.3863n[T'(w;y)  +  T'{v;y)  +  T'y{u;v)l  (45) 

The  quantities  on  the  right  side  of  (45)  have  degrees  of  freedom  that  sum  to 
{UVY  —  U  —  V  —  F  +  2),  Since  this  is  the  same  number  of  degrees  of 
freedom  as  on  the  left  hand  side  of  (45),  the  quantities  on  the  right  side  of 
(45)  are  asymptotically  independent,  if  the  null  hypothesis, 

is  true. 

This  means  that  as  an  approximation  we  can  test  T'{u',y),  T'{v;y)  and 
Tl{u;v)  simultaneously  for  significance  under  the  null  hypothesis  we  have 
stated.  The  test  is  very  similar  to  an  analysis  of  variance.  We  can  see  the 
similarity  by  applying  the  test  to  the  data  from  our  example  in  section  5. 
The  significance  tests  will  be  made  on  the  quantities  in  equation  (45).  To  do 
this  we  need  to  compute  C'(u,v,y)  and  Ty{u;v),  since  these  terms  were  not 
discussed  in  section  5.  First  we  note  that  C'(u,v,y)  is  the  total  amount  of  asso- 
ciation in  the  stimulus  X  response  X  preponse  table.  We  have 

C'(u,v,y)  =  2s  +  Sij^  —  Si  —  Sj  —  s„  , 

C'{u,v,y)  =  .69055. 

We  also  need  Ty{u;v),  the  information  transmitted  from  presponses  to  stimuli 
with  responses  held  constant.  This  measures  how  successfully  the  presponses 
predict  the  auditory  stimuU.  Since  stimuU  were  chosen  at  random,  we  do  not 
expect  much  transmitted  information  here.  The  computation  goes  as  follows: 

T^(w;y)  =  s„  —  s,„  —  s,„  +  s,-,„,  , 

=  T'{u',v)  +  A\uvy), 

=  .41435. 

We  may  now  put  our  computed  values  for  C'{u,v,y),  T'{u;y),  T'{v;y)  and 
Ty{u;v)  into  equation  (45)  and  perform  the  x  tests.  The  results  are  summarized 
in  Table  3.  We  have  not  attempted  to  calculate  the  significance  level  of 
C'{u,v,y)  because  we  do  not  have  enough  data  to  sustain  the  88  degrees  of 
freedom.  The  same  criticism  can  probably  be  leveled  at  our  test  for  T'y{u;v). 
In  any  case  Table  3  shows  that  the  only  significant  effect  in  the  experiment 
is  the  presponse-response  association. 

One  interesting  fact  that  the  analysis  brings  out  clearly,  is  that  we 
cannot  decide  whether  an  amount  of  transmitted  information  is  big  or  small 
without  knowing  its  degrees  of  freedom.  In  our  example  we  find  that  Ty{u;v)  = 
.414  bits,  while  T'{v;y)  =  .218  bits.  Yet  T'{v;y)  is  significant  and  T'y{u;v) 
is  not.  The  reason  lies  in  the  difference  in  degrees  of  freedom.  Miller  and 
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TABLE  3 

Table  of  Transmitted  Information 


Transmission 

Coniponent 

-2  loggX 

d.f. 

P 

Stimulus -Response 

T'(u;y) 

10.016 

12 

>.50 

Presponse-Response 

T*{v;y) 

37.8UU 

16 

<.01 

Presponse-Stimulus 

T'y(u;v) 

71.802 

60 

=.11* 

Total 

C'(u,v,y) 

119. 66U 

88 

Madow  (6)  have  discussed   the  amount  of  statistical  bias  in  information 
measures  due  to  degrees  of  freedom,  and  have  suggested  corrections. 

In  Table  3,  we  tested  T'y{u;v),  the  association  between  presponses  and 
stimuli  with  responses  held  constant.  This  association  is  broken  down  still 

TABLE  U 


Table  of  Transmitted  Information 


Transmission 

Component 

-2  loggX 

d.f. 

P 

Presponse-Stimulus 

T'(u;v) 

20.853 

12 

>.05 

Interaction 

A'(uvy) 

50.9'»8 

** 

Total 

T'y(u;v) 

71.802 

60 

•-.Ihr 

**     Probability  not  estimated. 

further  in  Table  4.  No  probability  is  estimated  in  Table  4  for  the  interaction 
term,  A'{uvy),  because  its  asymptotic  distribution  is  not  chi-square.  All 
A-terms  are  distributed  hke  the  difference  of  two  variables  each  of  which 
has  the  chi-square  distribution.  The  distribution  of  this  difference  is  evidently 
not  chi-square  because  the  difference  can  be  negative.  Its  density  function 
has  been  derived  by  Pearson,  Stouffer,  and  David  (9),  but  the  writer  has 
been  unable  to  find  a  table  of  the  integral.  In  some  cases  the  problem  can 
be  circumvented  by  combining  A-terms  with  T-terms  to  make  new  T-terms. 
[See,  for  example,  equation  (33).]  However,  in  other  cases,  the  interactions 
are  genuinely  interesting  in  their  own  right  and  should  be  tested  directly. 
These  cases  can  be  treated  when  adequate  tables  become  available. 
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RANDOM  FLUCTUATIONS  OF  RESPONSE  RATE* 
William  J.  McGill 

COLUMBIA  UNIVERSITY 

A  simple  model  for  fluctuating  interresponse  times  is  developed  and 
studied.  It  involves  a  mechanism  that  generates  regularly  spaced  excitations, 
each  of  which  can  trigger  off  a  response  after  a  random  delay.  The  excitations 
are  not  observable,  but  their  periodicity  is  reflected  in  a  regular  patterning 
of  responses.  The  probability  distribution  of  the  time  between  responses  is 
derived  and  its  properties  are  analyzed.  Several  limiting  cases  are  also 
examined. 

A  number  of  behavioral  systems  generate  sequences  of  pulse-like  re- 
sponses that  recur  regularly  in  time.  The  constant  beating  of  the  heart  is 
an  illustration  that  springs  immediately  to  mind.  Another  example  is  the 
optic  nerve  of  the  horseshoe  crab,  limulus,  which  is  famous  for  the  long 
trains  of  precisely  timed  action  potentials  it  produces  when  its  visual  receptor 
is  illuminated  by  a  steady  light  [6].  Response  sequences  with  comparable 
periodicity  are  also  found  in  studies  of  operant  conditioning  when  the  rate 
of  occurrence  of  a  response  is  stabilized  by  reinforcing  paced  responding; 
(see  [5],  pp.  498-502).  The  essential  point  to  bear  in  mind  in  each  of  these 
examples  is  the  fact  that  the  sequence  of  intervals  between  responses  is  not 
random.  The  intervals  resemble  the  ticking  of  a  watch  more  than  the  ir- 
regular fluctuations  of  a  stream  of  electrons.  Consequently  the  Poisson  dis- 
tribution, which  is  sometimes  proposed  [11]  to  deal  with  rate  fluctuations, 
is  not  likely  to  be  very  helpful. 

Under  close  scrutiny  the  timing  of  many  of  these  periodic  response 
systems  reveals  itself  to  be  less  than  perfect,  and  the  intervals  between 
responses  are  seen  to  change  in  small  amounts.  The  distribution  of  these 
changes  is  what  interests  us.  We  want  to  construct  a  model  with  both  periodic 
and  random  components.  Intervals  generated  by  the  model  will  then  be 
more  stable  than  a  purely  random  sequence,  but  less  stable  than  a  completely 
periodic  system.  Moreover  they  will  have  the  capacity  to  take  on  any  one 
of  the  wide  range  of  possibilities  between  these  extremes.  Since  this  seems 
to  be  a  natural  extension  of  the  type  of  randomness  found  in  physical  systems 
to  the  more  orderly  behavior  of  biological  processes,  it  is  surprising  that  the 
problem  has  received  so  Httle  attention. 

This  paper  is  an  attempt  to  examine  the  properties  of  an  elementary 
mechanism  for  producing  noisy  fluctuations  in  otherwise  constant  time 

*This  paper  was  completed  while  the  writer  was  a  visiting  summer  scientist  at  the 
Lincoln  Laboratory,  Lexington,  Mass. 

This  article  appeared  in  Psychometrika,  1962,  27,  3-17.    Reprinted  with  permission. 
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intervals.  Despite  its  simplicity,  the  mechanism  can  duplicate  a  variety  of 
observed  phenomena,  ranging  from  sharply  peaked  and  symmetrical  distri- 
butions of  interresponse  times  to  highly  skewed  distributions,  and  even 
completely  random  responding.  Moreover,  all  these  behaviors  can  be  elicited 
from  the  same  mechanism  by  altering  the  rate  at  which  it  is  excited. 

Periodic  Excitation 

We  begin  by  examining  interresponse  times  that  are  nearly  constant. 
The  key  to  this  regularity,  we  assume,  is  some  sort  of  periodic  excitatory 
process  that  triggers  a  response  after  a  short  random  delay.  Even  when  the 
excitations  are  not  observable  their  effects  are  seen  in  the  regular  intervals 
they  impose  between  responses.  The  periodic  mechanism  proposed  here  is 
diagrammed  in  Fig.  1,  which  also  illustrates  our  notation. 


"^1  "^2  "^S 

Figure  1 
Stochastic  latency  mechanism  yielding  variable  interresponse  times  with  a  periodic  com- 
ponent. Excitations  (not  observable)  come  at  regular  intervals  t,  but  are  subject  to  random 
delays  before  producing  responses.  Heavy  line  is  the  time  axis. 


E  and  R  denote  excitation  and  response  respectively.  The  time  interval 
between  two  successive  responses  is  a  random  variable  and  is  called  t.  The 
analogous  interval  (or  period)  between  excitations  is  a  fixed  (unknown) 
constant  t.  Excitation  and  response  almost  never  coincide  in  time.  Conse- 
quently a  response  will  almost  always  be  located  between  two  excitations, 
and  its  distance  from  each  excitation  can  be  expressed  as  two  location  co- 
ordinates. The  first  of  these,  r,  is  the  delay  from  a  response  to  the  next  fol- 
lowing excitation.  The  second,  s,  is  the  corresponding  interval  between  a 
response  and  the  excitation  that  immediately  precedes  it. 

The  basic  random  quantity  in  Fig.  1  is  s,  and  our  problem  is  to  deduce 
the  distribution  of  t  when  the  distribution  of  s  is  known.  Accordingly,  suppose 
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that  s  has  an  exponential  distribution  as  would  be  the  case  if  interresponse 
times  were  completely  random.  Let 


(1) 


Ks)  =  xe- 


where  /(s)  is  the  frequency  function  of  s,  and  X  is  a  positive  constant,  i.e., 
the  time  constant.  Equation  (1)  then  describes  a  very  simple  delay  process 
in  which  the  probability  of  a  response  during  any  short  interval  of  time  As 
following  excitation  is  constant  and  equal  to  XAs  (see  Feller  [4],  p.  220). 
This  defines  what  we  mean  by  "completely"  random;  the  instantaneous 
probability  of  response  is  independent  of  time. 

We  are  in  trouble  immediately,  for  (1)  is  not  strictly  legitimate  in  view 
of  the  requirements  just  set  down.  This  may  be  seen  from  the  fact  that 
the  exponential  is  distributed  on  the  interval  0  <  s  <  oo ,  whereas  the  maxi- 
mum value  of  s  in  Fig.  1  is  t. 

If  Xr  is  sufficiently  large,  no  real  trouble  is  encountered  because,  in  this 
circumstance,  the  average  delay  between  excitation  and  response  is  small 
compared  with  the  period  between  excitations.  Hence  the  response  to  Ei 
is  practically  certain  to  occur  before  E2  comes  along,  and  the  tail  of  the 
distribution  of  s  never  really  gets  tangled  with  the  next  following  excitation. 
When  it  happens  that  Xr  is  not  large,  a  simple  adjustment  of  (1)  is  required 
in  order  to  bound  s  between  zero  and  r,  without  changing  its  characteriza- 
tion as  a  completely  random  interval. 

Distribution  of  Interresponse  Times  With  a  Periodic  Component 

Our  main  results  are  given  in  (2)  and  (3),  which  describe  the  probabihty 
distribution  of  the  mechanism  outlined  in  the  first  section  and  pictured  in 
Fig.  1.  The  density  function  describing  the  distribution  of  the  time  interval  t 
between  tw^o  successive  responses  is 


(2) 


m 


Xj/ 


1   -    V 


sinh  \t 


1    +     J'  .     -X( 

Xe 


2p 


t   <    T, 
t   >    T, 


in  which  z/  is  a  constant  given  by  i^  =  e"^^  The  distribution  is  evidently 
skewed  and  has  a  well-defined  maximum  over  t  =  t. 

Whenever  Xr  happens  to  be  large  enough  so  that  v  is  negligibly  small, 
the  distribution  of  interresponse  time  in  (2)  simplifies  to 


(3) 


Ki  -   r) 


X    -xu- 
2' 


-co     <    t    -    T    < 


Equation  (3)  is  the  well-known  Laplace  density  function  [1].  It  is  symmetri- 
cal and  sharply  peaked  over  t  =  t,  and  describes  the  behavior  of  the  latency 
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mechanism  when  the  intervals  between  successive  responses  are  dominated 
by  the  periodic  component  r.  "Noise"  introduced  by  the  random  component 
must  then  be  small  in  comparison  with  the  periodicity  generated  by  the 
excitatory  process. 

The  approximation  in  (3)  is  easily  rationalized  if  1/X  is  considered  as 
measuring  the  magnitude  of  the  random  component.  In  that  case  1/Xr 
measures  the  size  of  the  noisy  perturbation  relative  to  the  period  between 
excitations.  Hence  the  parameter  v  will  go  toward  zero  whenever  the  ratio 
1/Xr  gets  small,  i.e.,  whenever  the  random  component  is  effectively  small. 
It  is  not  obvious  that  (2)  approaches  the  Laplace  distribution  as  v  disappears, 
but  a  brief  study  of  (2)  shows  that  this  is  in  fact  what  happens. 

Proof  of  the  Distribution* 

We  shall  now  show  that  (2)  is  the  correct  form  of  the  distribution  of 
interresponse  times  when  responses  are  triggered  by  periodic  excitations  as 
shown  in  Fig.  1. 

First  of  all,  (1)  must  be  adjusted  to  hold  s  between  zero  and  t.  This 
is  easily  handled.  We  begin  with  an  excitation  and  simply  cycle  the  exponential 
distribution  back  to  the  origin  as  soon  as  s  reaches  r,  letting  the  distribution 
continue  to  run  down  until  it  reaches  t  again,  and  repeating  the  process 
ad  infinitum.  The  ordinate  corresponding  to  any  point  s  between  zero  and 
T  will  then  be  given  by 

Consequently  the  position  of  the  response  in  the  interval  between  excitations 
will  be  distributed  as 

(la)  /(«)  =  ^^^"'^  0<s<r. 

The  distribution  of  r,  the  interval  from  the  response  to  the  next  following 
excitation,  is  now  determined,  since,  from  Fig.  1,  s  =  r  —  r.  Substituting 
in  (la)  yields 

(4)  /w  =  ^^^'^  o<r<r. 

Evidently  r  and  s  are  perfectly  (and  inversely)  correlated  in  the  same 
excitation  period.  On  the  other  hand  only  one  response  can  occur  between 
two  excitations.  Hence,  when  intervals  between  responses  are  analyzed,  r 
will  belong  to  one  excitation  period  and  s  will  belong  to  a  later  one,  thus 
making  r  and  s  independent  for  determining  t. 

It  should  be  clear  that  t  is  not  just  the  sum  of  r  and  s  although  a  cursory 

*The  writer  is  indebted  to  a  referee  for  suggesting  several  excellent  ways  to  simplify 
the  original  proof. 
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examination  of  Fig.  1  leaves  that  impression.  The  trouble  with  the  impression 
is  that  several  excitation  periods  may  separate  Ri  from  R2  .  In  other  words, 
responses  are  not  forced  by  excitations.  A  new  excitation  may  come  along 
before  the  response  is  emitted.  We  have  drawn  Fig.  1  as  though  response  R2 
fell  into  the  excitation  period  following  Ri  ,  but  a  moment's  reflection  suggests 
that  things  might  not  happen  so  neatly.  To  deal  with  this  nasty  eventuality, 
we  shall  define  t  as 

(5)  t  =  kr  +  r  +  s, 

where  r  is  taken  as  the  time  interval  between  Ri  and  the  next  following 
excitation,  s  is  the  analogous  interval  between  R2  and  the  excitation  im- 
mediately preceding  it,  and  k  is  the  number  of  periods  in  which  no  response 
occurs,  i.e.  the  number  of  empty  excitation  periods  between  Ri  and  R2  . 

Equation  (5)  is  now  a  unique  specification  of  t  in  terms  of  quantities 
whose  distributions  are  known  as  soon  as  X  and  r  are  fixed.  The  distributions 
of  r  and  s  have  already  been  specified  in  (4)  and  (la).  Our  next  step  is  to 
find  the  distribution  of  kr. 

An  interval  beginning  with  an  excitation  and  terminating  in  a  response 
may  span  several  excitations  before  the  response  occurs.  This  latency  is 
denoted  by  kr  +  5,  where  k  takes  on  values  0,  1,  2,  3,  etc.  It  is  evident  from 
the  arguments  leading  up  to  (la)  that  kr  -{■  s  has  the  exponential  distri- 
bution out  to  infinity.  Accordingly,  the  delay  from  an  excitation  to  the  first 
subsequent  response  can  be  resolved  into  two  independent  components: 
(i)  the  number  of  excitation  periods  passed,  and  (ii)  the  location  of  response 
R2  in  the  period  between  the  last  two  excitations.  In  view  of  the  independence 
of  k  and  s,  we  can  write 

-\(.kr  +  a)  T^/7      N      Xe 


\e~''''^''  =  Pikr) 


where  Pikr)  is  the  probabiUty  of  a  particular  value  of  kr.  We  find  that 

(6)  Pikr)  =  /(I  -  v). 

In  other  words,  the  distribution  of  kr  is  geometric  with  ordinates  spaced  out 
at  successive  multiples  of  t. 

All  three  components  of  (5)  are  independent.  Moreover,  the  variables 
r  and  s  form  a  unit  that  is  the  same  for  each  value  of  k.  Consequently,  (5) 
can  be  amended  to  read 

(5a)  i  =  kr  +  y, 

where  0  <  y  —  r  -\-  s  <  2t,  and  kr  has  the  geometric  distribution  given  by  (6). 
The  distribution  of  y  is  obtained  from  the  convolution  of  r  and  s.  After 
some  simplification  we  find  that 

(7)  f(y)  =\csinh\y  0  <y  <  r, 

[csinh  X(2r  —  y)  t  <  y  <  2r, 
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where 

\v 

c  = 


(1  -  ^r 

The  distribution  of  t  will  depend  on  the  number  of  excitations  between 
the  pair  of  responses  that  bound  each  interval.  This  number  fixes  k,  and  it 
follows  that  each  change  in  k  will  define  a  new  component  of  the  distribution 
of  t.  It  will  be  convenient  to  describe  each  component  separately  by  linking 
it  to  the  number  of  excitations  in  the  interval.  The  density  function  of  the 
A;th  harmonic  component  of  f{t)  will  be  indicated  as  fk(y),  since  k  and  y 
determine  t.  Equations  (5a)  and  (6)  yield 

(8)  /.(y)  =  /(I  -  v)/(y). 

For  example,  /o(?/)  is  the  density  function  of  interresponse  times  with  a 
single  excitation  between  each  pair  of  responses.  This  component  of  f{t) 
has  k  equal  to  zero  and  is  defined  over  the  interval  0  <  ^  <  2t.  The  average 
interresponse  time  in  the  interval  is  r. 

The  first  harmonic  component  fi{ij)  refers  to  interresponse  times  with 
just  two  excitations  between  each  pair  of  responses.  Hence  k  is  unity  and 
fiiy)  spans  values  of  t  between  t  and  St.  The  average  interresponse  time 
is  2t.  Higher  harmonic  components  are  defined  in  the  same  way. 

The  foregoing  makes  it  evident  that  for  values  oi  t  >  t  the  density 
function  has  contributions  from  two  harmonic  components  in  each  interval 
corresponding  to  the  length  of  an  excitation  period.  The  pair  of  contributors 
will  change  as  we  proceed  away  from  the  origin  in  multiples  of  r,  but  every 
element  of  density  in  f(t)  after  /  =  t  will  turn  out  to  have  two  components. 
Specifically, 

(9)  /(/)  =  U(y)  +  h.^iy  -  r)  r<y<  2r. 

If  the  densities  on  the  right-hand  side  of  (9)  are  replaced  by  equivalent 
expressions  determined  from  (8),  it  is  easily  shown  that 

(9a)  /(/)  =  ^Xe— '. 

Now  recall  that  kr  -\-  y  is  simply  another  way  of  writing  t,  and  it  is  apparent 
that  the  harmonic  components  of  f(t)  interlace  themselves  in  a  way  that 
produces  a  surprisingly  simple  expression  for  the  distribution  of  / : 


m  =  i 


2^ 
This  is  (2)  and  the  proof  is  complete. 
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Once  the  answer  is  known,  a  simpler  proof  can  be  established  via  a 
moment  generating  function.  Table  1  gives  the  moment  generating  functions 
for  r,  s,  and  kr,  all  of  which  are  easy  to  work  out.  The  theorem  governing 
moment  generating  functions  for  sums  of  random  variables  (see  Hoel  [8], 
or  ]\Iood  [10];  our  notation  follows  Hoel)  allows  us  to  write 

(10)  M,{d)  =  M,Ad)MXd)MXe), 
where  M  ,{d)  is  defined  as 

(11)  M^d)  =    f    e"mdt. 

Jo 

The  generating  functions  in  Table  1  are  now  substituted  for  the  corresponding 
terms  on  the  right-hand  side  of  (10),  and  we  obtain 

1  J-  _  .. 

(12)  M,{d) 


1  -  (d/xy  1  -  f  ' 

This  is  the  moment  generating  function  of  the  distribution  of  interresponse 
times.  It  happens  that  it  is  also  the  m.g.f.  of  (2),  a  fact  that  is  easily  demon- 
strated by  substituting  (2)  for  f{t)  in  (11).  The  properties  of  the  Laplace 
transform  assure  that  (2)  will  be  the  only  continuous  distribution  having 
the  required  m.g.f.  [12,  13]. 

The  Laplace  Distribution 

An  interesting  limiting  case  of  (2)  occurs  when  the  distribution  of  inter- 
response times  is  dominated  by  a  strong  periodic  component.  The  net  effect 
of  this  restriction  is  that  (2)  is  transformed  into  the  Laplace  distribution. 
Consequently  the  Laplace  distribution  characterizes  the  "noise"  in  a  class 
of  simple  timing  devices.  The  essential  feature  of  these  devices  is  that  they 
are  self- compensating.  Intervals  that  are  too  long  tend  to  be  followed  im- 
mediately by  intervals  that  are  too  short  and  vice  versa.  (The  correlation 
between  adjacent  interresponse  times  for  the  mechanism  pictured  in  Fig.  1 
is  —  .50).  This  type  of  regulation  is  really  what  enables  us  to  infer  that  regular 
excitations  must  be  occurring. 

The  approach  of  (2)  to  the  Laplace  distribution  is  easily  shown  via  its 
moment  generating  function.  Consider  (12)  when  v  goes  to  zero.  We  have 
immediately 

(13)  M,_xe) 


1  -  (d/xy 


This  is  the  m.g.f.  of  the  Laplace  distribution.  More  specifically,  (3)  has  (13) 

as  its  m.g.f.  The  proof  may  be  established  by  substituting  (3)  for/(0  in  (11). 

The  restriction  i^  =  0,  which  leads  from  (2)  to  the  Laplace  distribution, 

implies  that  k,  the  number  of  empty  excitation  periods,  must  always  be 
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zero.  This  follows  from  the  fact  that  the  geometric  distribution  in  (6)  col- 
lapses when  V  =  0.  Hence  t  in  Fig.  1  will  be  just  precisely  the  sum  of  r  and  s, 
and  we  can  ignore  the  possibihty  of  empty  excitation  periods.  Two  responses 
are  necessary  to  define  t.  Hence  there  must  also  be  two  independent  occur- 
rences of  s.  Call  them  Si  and  S2  corresponding  to  Ri  and  R2  ,  respectively. 
Refer  now  to  Fig.  1  and  observe  that 

t  =  r  +  S2  ,         T  =  r  +  Si,         i  —  T  =  S2  —  Si  . 

Consequently,  t  —  t  is  distributed  as  the  difference  of  two  exponential 
variables  and  we  can  write  its  moment  generating  function  as 

M,_.(^)  =  MXe)MX-d). 

The  m.g.f.  of  the  exponential  distribution  is,  of  course,  very  familiar  and  is 
given  in  Table  1  for  the  variable  kr  +  s.  Substituting  this  exponential  m.g.f. 
for  M ,{B),  we  obtain 

Mt-r{e)-'  =  (1  -^A)(i  +  ^/x) 
=  1  -  ieW~, 

which  is,  as  we  have  already  shown,  the  m.g.f.  of  the  Laplace  distribution. 
Evidently  the  Laplace  density  function  (3)  is  in  fact  simply  the  distribution 
of  the  difference  between  two  exponential  variables.  This  simple  point  is 
ignored  in  most  texts  on  statistics  because,  perhaps,  no  one  imagines  why 
anyone  else  would  be  interested.  Our  argument  estabhshes  a  very  good 
reason  for  being  interested.  The  difference,  and  hence  the  Laplace  distri- 
bution, provides  a  characterization  of  the  error  in  a  timing  device  that  is 
under  periodic  excitation. 

Continuous  Excitation 

The  latency  mechanism  also  behaves  in  an  interesting  way  as  the  period 
between  excitations  gets  very  small.  We  now  suppose  that  the  delay  part 
of  the  mechanism  has  a  fairly  slow  response,  but  is  bombarded  by  excitations 
following  one  another  in  rapid  succession.  The  restriction  is  achieved  sym- 
bolically by  fixing  the  delay  time  constant,  X,  while  allowing  r  to  approach 
zero.  Equation  (2)  i or  f(t)  immediately  leads  to: 

(14)  Vim  fit)  =  Xe"". 

T-.0 

The  result  is  almost  obvious.  The  portion  of  f{t)  between  t  =  0  and  t  =  t 
must  disappear  as  t  approaches  zero.  Meanwhile,  the  constant  v  is  approaching 
unity.  Consequently  the  Umit  for  f{t)  falls  right  out  of  the  portion  of  (2) 
defined  for  t  >  t.  The  same  exponential  limit  can  be  obtained  by  studying 
the  behavior  of  the  m.g.f.  for  f(t)  in  (12)  as  t  approaches  zero  with  X  fixed, 
or  by  analyzing  the  variance  of  fit)  under  these  same  limiting  conditions. 
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12  3  4  5 

INTERRESPONSE  TIME  t   (standard  units) 
Figure  2 
General  distribution  of  interresponse  times  with  arbitrary  random  and  periodic  parts. 
The  curve  is  a  plot  of  equation  (2)  in  the  text,  with  X  and  t  =  1.  Dashed  lines  are  harmonic 

components  of  the  distribution. 

Variances  are  given  in  Table  1.  The  formulas  establish  that  the  component 
within  harmonics  (i.e.,  the  variance  of  y)  disappears  as  r  vanishes,  and  the 
entire  variance  becomes  concentrated  in  the  differences  between  harmonics. 
This  imphes  that  the  probabihty  distribution  of  f{t)  must  congeal  around 
its  harmonic  peaks  (see  Fig.  2)  when  r  goes  to  zero,  and  that  each  peak  then 
contributes  a  "line"  of  density  to  the  resulting  exponential  distribution. 
Intuitively,  the  limit  in  (14)  means  that  no  delay  can  be  contributed  by 
the  latency  between  a  response  and  the  next  excitation.  That  excitation 
is  instantly  available.  Hence  r  in  Fig.  1  vanishes  and  the  entire  interval 
is  consumed  by  the  latency  between  excitation  and  response,  which  we  have 
assumed  to  be  exponential. 

Applications 

Fig.  3  presents  a  frequency  distribution  compiled  from  a  long  .series 
of  action  potentials  recorded  on  a  single  fiber  of  the  optic  nerve  of  limidus. 
The  narrow  distribution  demonstrates  that  the  data  are  periodic  and  the 
periodicity  seems  to  originate  in  the  refractory  period  of  the  nerve  fiber. 
The  mechanism,  however,  is  not  well  understood.  In  this  particular  case, 
the  regular  sequence  of  action  potentials  was  achieved  by  dissecting  out  a 
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-40         -20  0+20+40 

DEVIATION    FROM   PERIOD  (  milliseconds  ) 

Figure  3 
Frequency  distribution  of  303  interresponse  times  observed  in  a  single  fiber  of  the  optic 
nerve  of  limulus  when  the  eye  was  illuminated  by  a  steady  light.  The  nerve  fiber  adapted 
continuously  to  the  illumination,  resulting  in  a  slow  linear  increase  in  period  from  261  to 
291  milliseconds.  Measured  intervals  are  deviations  from  the  linear  drift.  Smooth  curve 

is  a  Laplace  distribution. 

fiber  of  the  optic  nerve,  and  shining  a  beam  of  light  on  the  receptor,  i.e., 
the  ommaiidium,  to  which  the  fiber  was  attached.  Under  steady  illumination, 
the  nerve  fiber  produced  a  barrage  of  discrete  responses  which  were  then 
amplified  and  recorded  on  magnetic  tape.  Later  on,  the  tape  was  played 
into  the  control  gate  of  a  digital  counter,  and  time  intervals  between  alternate 
pairs  of  responses  were  read  out  onto  a  permanent  record.  The  timing  signal 
passed  through  the  gate  was  a  1000  cps  sine  wave  generated  by  a  calibrated 
tuning  fork  oscillator.  Over-all  accuracy  of  the  measurements  is  of  the  order 
of  ±  2  milliseconds,  due  to  variations  in  the  speed  of  the  tape  recorder.* 
The  nerve  fiber  adapted  continuously  to  steady  illumination,  resulting 
in  a  slow  increase  in  the  basic  period  from  about  261  milliseconds  to  291 
milliseconds.  This  change  was  isolated  by  averaging  the  data  in  blocks  of 
25  intervals,  and  fitting  a  line  to  the  averages,  which  fortunately  were  quite 
linear.  Measured  intervals  between  responses  were  converted  into  deviations 
from  this  line,  yielding  the  frequency  distribution  in  Fig.  3.  The  fitted  curve 
is  a  Laplace  density  function. 

*The  preparation  and  recording  were  made  by  C.  G.  Mueller.  The  data  were  recovered 
and  analyzed  by  the  writer  with  the  assistance  of  Michael  S.  Kennedy. 
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A  normal  curve  fitted  to  the  same  data  would  have  high  shoulders  and 
a  flat  top.  This  fact  then  defines  the  distribution  of  interrespbnse  time  as 
being  leptokurtic.  Another  illustration  is  provided  in  Fig.  4  and  is  taken 
from  data  reported  by  Hill  [7].  The  distribution  was  obtained  by  mea.suring 
intervals  between  successive  bar-presses  made  by  a  white  rat.  The  data 
were  taken  on  the  93rd  day  of  conditioning  with  a  reinforcement  schedule 
in  which  payoff  was  contingent  on  delaying  at  least  21  seconds  from  the 
last  previous  response.  The  normal  approximation  to  Hill's  data  is  shown 
in  Fig.  4  by  the  dashed  frequency  distribution  in  the  background.  This 
normal  curve  was  fitted  by  matching  mean  and  variance  to  the  data.  Re- 
sponses in  the  0-3  second  class  interval  w^ere  not  used  for  this  purpose  because 
bursts  of  responses  immediately  after  reinforcement  are  believed  to  be  un- 
related to  the  main  effect.  In  any  event  the  leptokurtic  character  of  Hill's 
data  is  evident,  and  it  suggests  that  the  long  regimen  of  training  (184  hours) 
on  the  time  discrimination  problem  made  Hill's  rat  into  a  fairly  accurate 
Laplace-type  clock.  We  are  led  naturally  to  conjecture  about  how  the  rat 
constructs  t.  Does  it  happen  internally  via  some  type  of  neurological  clock  or 
externally  via  a  stereotyped  sequence  of  movements? 
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Figure  4 

Distribution  of  interresponse  times  produced  by  a  bar-pressing  rat  after  a  long  period  of 

conditioning  on  a  schedule  in  which  reinforcement  is  contingent  on  delaying  at  least  21 

seconds  from  last  previous  response.  Dotted  curve  is  best  fitting  normal  approximation 

and  demonstrates  peaking  of  empirical  distribution.  (Data  from  Hill). 
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Skewed  distributions  of  interresponse  times  with  the  appearance  of 
(2)  (see  Fig.  2)  are  found  often  in  the  literature,  usually  in  connection  with 
high  speed  responding.  Fig.  5  is  taken  from  Brandauer  [2]  who  studied 
response  sequences  generated  by  a  pigeon  pecking  at  a  small  illuminated 
target.  Reinforcement  was  controlled  by  a  high  speed  flip-flop  and  the  bird 
was  reinforced  whenever  a  peck  happened  to  coincide  in  time  with  a  particular 
one  of  the  two  states  of  the  flip-flop.  Consequently,  the  probability  of  rein- 
forcement was  determined  by  the  proportion  of  time  the  flip-flop  spent 
in  that  state,  and  the  net  result  was  that  every  response  had  the  same  (low) 
probability  of  reinforcement.  The  pigeon  generated  an  average  rate  of  5.3 
responses  per  second  during  the  run  shown  in  Fig.  5  which  covers  approxi- 
mately 1000  responses.  If  the  sharp  peak  in  Fig.  5  is  in  fact  created  by  a 
periodic  excitatory  mechanism,  we  would  conclude  that  excitations  were 
coming  even  faster  than  5.3  times  a  second.  This  follows  because  the  average 
length  of  the  interval  between  responses  is  increased  by  the  exponential 
tail  which  in  turn  reflects  varying  degrees  of  failure  to  follow  excitation. 
In  this  case  it  is  likely  that  the  period  t  is  constructed  by  a  pre-programmed 
rhythmic  oscillation  of  the  head  something  like  the  mechanism  that  humans 
use  in  order  to  generate  high  rates  of  tapping. 
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Figure  5 

Frequency  distribution  of  1000  interresponse  times  recorded  from  a  pigeon  pecking  at  a 

high  rate.  Intervals  longer  than  .23  seconds  not  shown.  (Data  from  Brandauer). 
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In  a  recent  paper,  Hunt  and  Kuno  [9]  present  several  distributions  of 
interresponse  times  recorded  during  spontaneous  activity  of  single  fibers 
in  the  spinal  cord  of  the  cat.  The  data  run  the  gamut  from  the  Laplace  to 
the  exponential,  including  several  examples  of  what  appears  to  be  our  skewed 
distribution  (Fig.  2).  The  effect  is  exactly  what  might  be  expected,  if  the 
same  general  response  system  were  subjected  to  varying  rates  of  periodic 
excitation. 

Discussion 

It  would  be  hard  to  find  levels  of  behavior  further  apart  than  single 
fiber  activity  and  overt  responding.  Yet  the  distributions  of  interresponse 
times  presented  in  this  paper  seem  applicable  to  both,  and  in  the  limited 
view  afforded  by  a  study  of  the  time  between  responses,  neither  system  looks 
more  complicated  or  better  organized  than  the  other. 

When  we  find  stochastic  mechanisms  like  Fig.  1  operating  in  overt 
responding,  it  probably  means  no  more  than  that  complicated  systems  of 
neurons  can  be  organized  to  do  very  simple  jobs.  Even  so,  the  noise  in  an 
organization  may  give  a  clue  to  the  nature  of  the  organization,  and  thus 
provide  a  way  to  study  it.  When  we  ask,  as  we  did  earlier,  how  the  animal 
constructs  r,  we  have  to  find  a  way  that  is  compatible  with  our  conception 
of  the  mechanism  as  dictated  by  the  noise. 

The  delineation  of  simple  periodic  mechanisms  affords  similar  insights 
into  information  coding  in  single  nerve  fibers.  Ejiowledge  of  the  general 
form  of  the  coding  mechanism  indicates  what  kind  of  noise  higher  centers 
have  to  face,  and  suggests  possible  ways  for  detecting  periodicity  in  the 
noise.  For  instance,  the  Laplace  distribution  presents  very  interesting  prob- 
lems to  a  device  attempting  to  estimate  its  parameters  [1]. 

The  latency  mechanism  considered  in  this  paper  barely  scratches  the 
surface  x»f  the  possibilities.  It  turns  out  that  our  mechanism  has  indistinguisha- 
ble excitations.  Whenever  a  new  excitation  appears  before  there  is  a  response 
to  an  earlier  one,  it  makes  no  difference  whether  the  new  excitation  replaces 
the  old  one  and  reactivates  the  response  trigger,  or  is  simply  blocked  by 
the  excitation  that  is  already  working.  Once  this  is  clear  other  suggestions 
for  summating  excitations  or  for  parallel  channeling  present  themselves. 
For  example,  there  are  a  number  of  harmonic  distributions  of  interresponse 
times  in  the  Hterature  [3].  These  distributions  show  clusterings  of  inter- 
response times  at  multiples  of  a  fundamental  period,  and  hence  seem  to 
closely  related  to  (2)  which  also  has  harmonic  components.  But  something 
else  is  required,  and  it  is  not  entirely  clear  yet  what  that  something  else  is. 
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SENSITIVITY  TO  CHANGES  IN  THE  INTENSITY 

OF  WHITE  NOISE  AND  ITS  RELATION  TO 

MASKING  AND  LOUDNESS^ 

George  A.  Miller 

Sensitivity  to  changes  in  the  intensity  of  a  random  noise  was  determined  over  a 
wide  range  of  intensities.  The  just  detectable  increment  in  the  intensity  of  the  noise  is  of 
the  same  order  of  magnitude  as  the  just  detectable  increment  in  the  intensity  of  pure 
tones.  For  intensities  more  than  30  db  above  the  threshold  of  hearing  for  noise  the  size 
in  decibels  of  the  increment  which  can  be  heard  50  percent  of  the  time  is  approximately 
constant  (0.41  db).  When  the  results  of  the  experiment  are  regarded  as  measures  of 
the  masking  of  a  noise  by  the  noise  itself,  it  can  be  shown  that  functions  which  describe 
intensity  discrimination  also  describe  the  masking  by  white  noise  of  pure  tones  and  of 
speech.  It  is  argued,  therefore,  that  the  determination  of  differential  sensitivity  to 
intensity  is  a  special  case  of  the  more  general  masking  experiment.  The  loudness  of  the 
noise  was  also  determined,  and  just  noticeable  differences  are  shown  to  be  unequal  in 
subjective  magnitude.  A  just  noticeable  difference  at  a  low  intensity  produces  a  much 
smaller  change  in  the  apparent  loudness  than  does  a  just  noticeable  difference  at  high 
intensity. 

Differential  sensitivity  to  intensity  is  one  of  the  oldest  and  most  important 
problems  in  the  psychophysics  of  audition.  But  previous  experiments  have  concerned 
themselves  mainly  with  sensitivity  to  changes  in  the  intensity  of  sinusoidal  tones,  and 
if  we  want  to  know  the  differential  sensitivity  for  a  complex  sound,  it  is  necessary 
either  to  extrapolate  from  existing  information,  or  actually  to  conduct  the  experiment 
for  the  sound  in  question.  This  gap  in  our  knowledge  is  due  to  expediency,  not  over- 
sight. The  realm  of  complex  sounds  includes  an  infinitude  of  acoustic  compounds,  and 
experimental  parameters  extend  in  many  directions.  Just  which  of  these  sounds  we 
select  for  investigation  is  an  arbitrary  matter.  Of  the  various  possibilities,  however, 
one  of  the  most  appropriate  is  random  noise,  a  sound  of  persistent  importance  and  one 
which  marks  a  sort  of  ultimate  on  a  scale  of  complexity. 

Although  the  instantaneous  amplitude  varies  randomly,  white  noise  is  perceived 
as  a  steady  "hishing"  sound,  and  it  is  quite  possible  to  determine  a  listener's  sensitivity 
to  changes  in  its  intensity.^  The  present  paper  reports  the  results  of  such  determinations 
for  a  range  of  noise  intensities. 

Apparatus  and  Procedure 

A  white-noise  voltage,  produced  by  random  ionization  in  a  gas  tube,  was  varied 
in  intensity  by  shunting  the  line  with  known  resistances  provided  by  a  General  Radio 

This  article  appeared  in  J.  Acoust.  Soc.  Amer.,  1947,  19,  609-619.  Reprinted  with 
permission. 

^  This  research  was  conducted  under  contract  with  the  U.S.  Navy,  Office  of  Naval 
Research  (Contract  N5ori-76,  Report  PNR-28). 

^  J.  E.  Karlin,  Auditory  tests  for  the  ability  to  discriminate  the  pitch  and  the  loudness  of 
noises,  OSRD  Report  No.  5294  (Psycho- Acoustic  Laboratory,  Harvard  University,  August  1, 
1945)  (available  through  the  Office  of  Technical  Services,  U.S.  Department  of  Commerce, 
Washington,  D.  C). 
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Figure  1 
Schematic  diagram  of  equipment  with  the  equivalent  circuit  used  in  the  computation  of  the 

size  of  the  increment  in  intensity. 
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Figure  2 
Nomogram  to  convert  values  of  AP/P  to  AP  when  P  is  known. 
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Decade  Resistance  Box.  A  schematic  diagram  of  the  equipment  is  shown  in  Fig.  1. 
The  attenuators  were  used  to  keep  constant  the  values  of  source  and  load  impedance, 
Rq  and  Rj^,  surrounding  the  shunt  resistances,  Ri  and  R2,  since  these  values  must  enter 
into  the  computation  of  the  increment  which  is  produced  by  the  insertion  of  the  variable 
resistance,  7?2-  The  whole  system  can  be  represented  by  the  equivalent  circuit,  also 
shown  in  Fig.  1.   For  this  circuit,  the  size  of  an  increment  in  voltage  ^El  is  given  by 

AEl  R^R^Rl 


El       RiiRoiRi  +  R2+  R3)  +  Rl(Ri  +  R2)] ' 

If  the  system  does  not  introduce  amplitude  distortion  after  the  increments  are 
produced,  the  increment  in  sound  pressure,  expressed  in  decibels,  can  be  taken  as 
201ogio(l  +  AElIEl). 

Throughout  the  following  discussion  the  intensity  of  the  noise  will  be  stated  in 
terms  of  its  sensation  level — the  number  of  decibels  above  the  listener's  absolute  thresh- 
old for  the  noise.  If  the  sound-pressure  level  of  the  noise  is  taken  to  be  the  level 
generated  by  a  moving-coil  earphone  (Permoflux  PDR-10)  when  the  voltage  across  the 
earphone  (measured  by  a  thermocouple)  is  the  same  as  the  voltage  required  for  a  sinu- 
soidal wave  (1000  cycles)  to  generate  the  given  sound  pressure  in  a  volume  of  6  cc,  then 
the  absolute  threshold  for  the  noise  corresponds  to  a  sound  pressure  of  approximately 
10  db  re  0.0002  dyne/cm^.  Thus  the  sensation  level  can  be  converted  into  sound- 
pressure  level  by  the  simple  procedure  of  adding  10  db  to  the  value  given  for  the 
sensation  level.  The  spectrum  of  the  noise  was  relatively  uniform  (±5  db)  between 
150  and  7000  c. p. s.  The  measurement  and  spectrum  of  the  noise  transduced  by  the 
earphone  PDR-10  has  been  discussed  in  detail  by  Hawkins.^ 

Once  the  sound-pressure  level  and  the  relative  size  of  the  increment  in  decibels 
are  known,  the  absolute  value  of  the  increment  can  be  computed.  Those  interested  in 
converting  the  decibels  into  dynes/cm^  will  find  the  nomogram  of  Fig.  2  a  considerable 
convenience.  A  straight  line  which  passes  through  a  value  of  A/  in  decibels  on  the  left- 
hand  scale,  and  through  a  value  of  the  sound  pressure  on  the  middle  scale,  will  intersect 
the  right-hand  scale  at  the  appropriate  value  of  AP  in  dyne/cm'^.  When  the  stimulus  is  a 
plane  progressive  sound  wave,  its  acoustic  intensity  in  watts/cm^  is  proportional  to  the 
square  of  the  pressure :  /  —  kp^. 

The  peak  amplitudes  in  the  wave  of  a  white  noise  are  not  constant.  It  is  reason- 
able to  expect,  therefore,  that  the  size  of  the  just  noticeable  difference  might  vary  as  a 
function  of  the  distribution  of  peak  amplitudes  in  the  wave.  In  order  to  evaluate  this 
aspect  of  the  stimulus,  a  second  experiment  .was  conducted.  The  noise  voltage  was 
passed  through  a  square-wave  generator  (Hewlett  Packard,  Model  210-A)  before  the 
increments  were  introduced.  The  spectrum  and  subjective  quality  of  the  noise  are  not 
altered  by  the  square-wave  generator,  but  the  peak  amplitudes  are  "squared  oflF"  at  a 
uniform  level.  The  resulting  wave  form  might  be  described  as  a  square- wave  modulated 
randomly  in  frequency. 

^  J.  E.  Hawkins,  "The  masking  of  pure  tones  and  of  speech  by  white  noise,"  in  a  report 
entitled  The  masking  of  signals  by  noise,  OSRD  Report  No.  5387  (Psycho- Acoustic  Laboratory, 
Harvard  University,  October  1,  1945)  (available  through  the  Office  of  Technical  Services, 
U.S.  Department  of  Commerce,  Washington,  D.C.). 
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The  experimental  procedure  for  determining  differential  sensitivity  was  the  same 
as  that  employed  by  Stevens,  Morgan,  and  Volkmann.^  The  only  difference  was  the 
omission  of  a  signal  light  which  they  sometimes  used  to  indicate  the  impending  presen- 
tation of  an  increment.  The  observer,  seated  alone  in  a  sound-treated  room,  listened 
to  the  noise  monaurally  through  a  high  quality,  dynamic  earphone  (PDR-10).  The 
listener  heard  a  continuous  noise,  to  which  an  increment  was  added  periodically.  A 
series  of  25  identical  increments  (1.5  sec.  duration  at  intervals  of  4.5  sec.)  was  presented, 
and  the  percentage  heard  was  tabulated.   Four  such  series  were  used  to  determine  each 

TABLE  I 

DiflFerential  Sensitivity  for  Intensity  of  Noise.    Increments  in 

Decibels  Which  Two  Listeners  Could  Hear  50  Percent  of  the 

Time,  as  a  Function  of  Sensation  Level. 


Sensation 

Random  noise 

Square-wave  noise 

level 

GM 

SM 

GM 

SM 

3db 

3.20  db 

3.20  db 

5 

3.00 

2.10 

10 

1.17 

1.17 

12 

0.97  db 

0.89  db 

15 

0.85 

0.66 

20 

0.49 

0.55 

25 

0.46 

0.54 

32 

0.40 

0.39 

35 

0.40 

0.50 

45 

0.42 

0.44 

52 

0.40 

0.46 

55 

0.39 

0.50 

70 

0.39 

0.47 

82 

0.32 

0.47 

85 

0.33 

0.48 

100 

0.28 

0.40 

of  5  to  8  points  on  a  psychometric  function,  and  from  this  function  the  differential 
threshold  was  obtained  by  linear  interpolation.  Thus  500  to  800  judgments  by  each  of 
two  experienced  listeners  were  used  to  determine  each  differential  threshold  at  the  16 
different  intensities. 

Results 

The  increments  in  decibels  which  the  two  listeners  could  hear  50  percent  of  the 
time  are  presented  in  Table  1  as  a  function  of  the  sensation  level  of  the  noise.  It  will 
be  noted  that  the  differential  sensitivity  for  "square-wave  noise"  is  not  significantly 
greater  than  that  for  random  noise.  Apparently  the  fluctuations  in  the  peak  amplitude 
of  the  wave  do  not  influence  the  size  of  the  just  noticeable  increment.  The  response  of 
the  ear  is  probably  too  sluggish  to  follow  these  brief  fluctuations.  And  since  the 
difference  between  the  two  wave  forms  is  essentially  a  matter  of  the  phase  relations 

*  S.  S.  Stevens,  C.  T.  Morgan,  and  J.  Volkmann,  Theory  of  the  neural  quantum  in  the 
discrimination  of  loudness  and  pitch.  Am.  J.  Psychol,  1941,  54,  315-335. 
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Figure  3 

Increments  in  intensity  heard  50  percent  of  the  time  are  plotted  as  a  function  of  the  intensity 

of  the  noise  in  decibels  above  the  threshold  of  hearing.    Data  for  tones  are  presented  for 

purposes  of  comparison.   The  solid  line  represents  Eq.  (2). 


among  the  components,  we  may  conclude  that  these  phase  relations  have  no  important 
eflfect  on  differential  sensitivity. 

The  data  indicate  that,  for  intensities  30  db  or  more  above  the  absolute  threshold, 
the  relative  differential  threshold  is  approximately  constant.  At  the  highest  intensities 
the  value  is  about  0.41  db,  which  corresponds  to  a  Weber-fraction  of  0.099  for  sound 
energy,  or  0.048  for  sound  pressure.  The  range  over  which  the  increment  is  propor- 
tional to  the  level  of  stimulation  is  indicated  by  the  horizontal  portion  of  the  solid 
curve  in  Fig.  3.  The  values  over  this  range  of  intensities  agree  quite  well  with  the  values 
obtained  by  Karlin^  with  a  group  of  50  listeners. 

For  purposes  of  comparison,  Fig.  3  includes  data  obtained  by  Riesz^  and  by 
Knudsen^  for  tones.  Knudsen's  results  do  not  differ  markedly  from  those  obtained  for 
noise,  but  Riesz's  data  are  quite  different,  especially  at  low  intensities.  Possibly 
Knudsen's  data  represent  sensitivity  to  the  "noise"  introduced  by  the  abrupt  onset  of 
his  tones,  or  possibly  Riesz's  data  at  low  intensities  are  suspect  because  of  his  use  of 
beats  to  produce  increments  in  intensity.  Data  obtained  by  Stevens  and  Volkmann^  for 
a  single  listener  at  four  intensities  of  a  1000-cycle  tone  seem  to  agree  more  closely  with 
the  present  results  than  with  Riesz's,  but  their  data  are  not  complete  enough  to  deter- 
mine a  function.  Churcher,  King,  and  Davies^  have  reported  data  with  a  tone  of 
800  c.p.s.  which  compare  favorably  with  the  function  of  Riesz.  Taken  together,  all 
these  studies  indicate  that  the  difference  limen  for  intensity  is  of  the  same  order  of 

^  R.  R.  Riesz,  Differential  intensity  sensitivity  of  the  ear  for  pure  tones,  P/iys.  Rev., 
1928,  31,  867-875. 

"  V.  O.  Knudsen,Thesensibility  oftheeartosmalldifferencesinintensity  and  frequency, 

Phys.  Rev.,  1923,  21,  84-103. 

■^  S.  S.  Stevens  and  J.  Volkmann,  The  quantum  of  sensory  discrimination.  Science, 
1940,  92,  583-585. 

*  B.  G.  Churcher,  A.  J.  King,  and  H.  Davies,  The  minimum  perceptible  change  of 
intensity  of  a  pure  tone,  Phil.  Mag.,  1934,  18,  927-939. 
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magnitude  for  noise  as  it  is  for  tones,  at  least  at  the  higher  levels  of  intensity.^  At  the 

lower  intensities  the  discrimination  for  a  noise  stimulus  may  be  somewhat  more  acute 

than  for  tones. 

Implications  for  a  Quanta!  Theory  of  Discrimination 

The  notion  that  the  difference  limen  depends  upon  the  activation  of  discrete 
neural  units  is  not  new.  It  is  suggested  by  the  discreteness  of  the  sensory  cells  them- 
selves. Only  recently,  however,  has  evidence  been  obtained  to  support  the  assumption 
that  the  basic  neural  processes  mediating  a  discrimination  are  of  an  all-or-none 
character. 

The  principal  evidence  derives  from  the  shape  of  the  psychometric  function. 
Stevens,  Morgan,  and  Volkmann^"  present  the  argument  in  the  following  way : 

We  assume  that  the  neural  structures  initially  involved  in  the  perception  of  a 
sensory  continuum  are  divided  into  functionally  distinct  units.  .  .  .  The  stimulus  which 
excites  a  certain  number  of  quanta  will  ordinarily  do  so  with  a  little  to  spare — it  will 
excite  these  quanta  and  leave  a  small  surplus  insufficient  to  excite  some  additional 
quantum.  This  surplus  stimulation  will  contribute,  along  with  the  increment,  A/,  to 
bring  into  activity  the  added  quantum  needed  for  discrimination.  .  .  .  How  much  of 
this  left-over  stimulation  or  surplus  excitation  are  we  to  expect?  If  [the  over-all 
fluctuation  in  sensitivity]  is  large  compared  to  the  size  of  an  individual  quantum,  it  is 
evident  that  over  the  course  of  time  all  values  of  the  surplus  stimulation  occur  equally 
often.  .  .  .  From  these  considerations  it  follows  that,  if  the  increment  is  added  instan- 
taneously to  the  stimulus,  it  will  be  perceived  a  certain  fraction  of  the  time,  and  this  frac- 
tion is  directly  proportional  to  the  size  of  the  increment  itself. 

When  the  increments  are  added  to  a  continuous  stimulus,  however,  the  listener 
finds  it  difficult  to  distinguish  one-quantum  changes  in  the  stimulus  from  the  changes 
which  are  constantly  occurring  because  of  fluctuations  in  his  sensitivity.  In  order  to 
make  reliable  judgments,  the  listener  is  forced  to  ignore  all  one-quantum  changes. 
Consequently,  a  stimulus  increment  under  these  conditions  must  activate  at  least  two 
additional  neural  units  in  order  that  a  difference  will  be  perceived  and  reported.  Thus, 
in  effect,  a  constant  error  of  one  quantum  is  added  to  the  psychometric  function. 

The  psychometric  function  predicted  by  this  line  of  reasoning  can  be  described 
in  the  following  way.  When  the  stimulus  increments  to  a  steady  sound  are  less  than 
some  value  A/q,  they  are  never  reported,  and  over  the  range  of  increments  from  0  to 
A/q  the  psychometric  function  remains  at  0  percent.  Between  A/q  and  2AIq  the 
proportion  of  the  increments  reported  varies  directly  with  the  size  of  the  increment,  and 
reaches  100  percent  at  2 A/q.   Such  a  function  is  illustrated  by  the  solid  line  of  Fig.  4. 

It  will  be  noted  that  the  difference  which  is  reported  50  percent  of  the  time  is 
equivalent  to  1.5  times  the  quantal  increment.  If  we  take  this  value  as  defining  a  unit 
increment  in  the  stimulus,  all  the  psychometric  functions  obtained  for  the  two  listeners 
can  be  combined  into  a  single  function.  In  other  words,  we  can  adjust  the  individual 
intensity  scales  against  which  the  functions  are  plotted  in  order  to  make  all  the  functions 
coincide  at  the  50  percent  point.   In  Fig.  4  the  size  of  the  relative  increment  in  sound 

^  Of  the  modern  investigations,  only  Dimmick's  disagrees  strikingly  with  the  values 
reported  here  for  the  higher  intensities.  F.  L.  Dimmick  and  R.  M.  Olson,  The  intensive  differ- 
ence limen  in  audition,/,  acoust.  Soc.  Am.,  1941,  12,  517-525. 

"  See  reference  4,  p.  317. 
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0  IQ  2Q  3Q 

AP/P  adjusted  to  1.5Q  =  j.n.d. 

Figure  4 

The  32  psychometric  functions  combined  in  a  single  graph.   Values  of  APjP  heard  50  percent 

of  the  time  are  designated  as  1.5Q,  and  the  datum  points  on  each  function  are  plotted  relative 

to  this  value.    Each  point  represents  100  judgments. 


pressure,  AP/P,  has  been  adjusted  so  that  the  increment  which  was  heard  50  percent  of 
the  time  is  plotted  as  1.5  times  the  quantal  increment. 

Figure  4  shows  that  the  characteristic  quantal  function  was  not  obtained  in  this 
experiment.  The  data  are  better  described  by  the  phi-function  of  gamma  (the  normal 
probability  integral)  indicated  by  the  dashed  line. 

The  classical  argument  for  the  application  of  the  cumulative  probability  function 
to  the  difference  limen  assumes  a  number  of  small,  indeterminate  variables  which  are 
independent,  and  which  combine  according  to  chance.  When  these  variables  are  con- 
trolled or  eliminated,  the  step-wise,  "quantal"  relation  is  revealed."  If  this  reasoning 
is  correct,  then  the  deviations  of  the  points  in  Fig.  4  from  the  quantal  hypothesis  should 
be  attributable  to  the  introduction  of  random  variability  into  the  listening  situation. 

Is  there  any  obvious  source  of  randomness  in  the  experiment?  Certainly  there 
is,  for  white  noise  is  a  paradigm  of  randomness.  The  statistical  nature  of  the  noise 
means  that  the  calculated  value  of  the  increment  is  merely  the  most  probable  value, 
and  that  a  certain  portion  of  the  time  the  increment  will  depart  from  this  probable 
value  by  an  amount  sufficient  to  affect  the  discrimination.  And  in  view  of  the  fluctuat- 
ing level  of  the  stimulus,  it  would  be  surprising  indeed  if  the  rigorous  experimental 
requirements  of  the  quantal  hypothesis  were  fulfilled.  This  situation  demonstrates  the 
practical  difficulty  in  obtaining  the  rectilinear  functions  predicted  by  the  quantal 
hypothesis.  Any  source  of  variability  tends  to  obscure  the  step-wise  results  and  to 
produce  the  S-shaped  normal  probability  integral. 

It  should  be  noted,  however,  that  the  shape  of  the  psychometric  function  is  only 
one  of  the  implications  of  the  quantal  argument.  According  to  the  hypothesis,  the  slope 

"  G.  A.  Miller  and  W.  R.  Garner,  Effect  of  random  presentation  on  the  psychometric 
function:  Implications  for  a  quantal  theory  of  discrimination.  Am.  J.  Psychol.,  1944,  57,  451- 
467. 
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of  the  psychometric  function  is  determined  by  the  size  of  the  difference  limen  for  all 
values  of  stimulus-intensity.  The  present  data  accord  with  this  second  prediction. 
The  standard  deviations  of  the  probability  integrals  which  describe  the  data  are 
approximately  one-third  the  means  (or  0.5  AIq)  for  all  the  thresholds  measured  for  both 
subjects.  This  invariance  in  the  slope  of  the  function  is  necessary  but  not  sufficient 
evidence  for  a  neural  quantum,  and  it  makes  possible  the  representation  of  the  results 
in  the  form  shown  in  Fig.  4. 

Symbolic  Representation  of  the  Data 

In  order  to  represent  the  experimental  results  in  symbolic  form,  the  following 
symbols  will  be  used : 

b  numerical  constant  =  1.333, 

c  numerical  constant  ==  0.066  =  A/^//  when  /  >  /„, 

DL  difference  limen  (just  noticeable  difference)  expressed  in  decibels, 

/  frequency  in  cycles  per  second, 

/  sound  intensity  (energy  flow), 

/~  sound  intensity  per  cycle, 

/q  sound  intensity  which  is  just  audible  in  quiet, 

I^  sound  intensity  which  is  just  masked  in  noise, 

AIq  quantal  increment  in  sound  intensity  =  0.667^1^,^, 

A/gQ  increment  in  sound  intensity  heard  50  percent  of  the  time, 

L  loudness  in  sones, 

M  masking  in  decibels, 

Nq  number  of  quantal  increments  above  threshold, 

R  signal-to-noise  ratio  per  cycle  at  any  frequency, 

Z  effective  level  of  noise  at  any  frequency. 

An  adequate  description  of  the  data  in  Table  I  can  be  developed  from  the 
empirical  equation  M^  ^  d  +  bl„     I  >  I„  (1) 

where  the  quantal  increment  in  the  stimulus-energy  is  assumed  to  have  a  fixed  and  a 
variable  component.  Since  A/g^ — the  increment  which  can  be  heard  50  percent  of  the 
time — equals  1.5A/q,  we  can  write 

DL  =  101ogio(l  +  A4o//)  =  101ogio[l  +  1.5c  +  1.5b(IJI)].  (2) 

From  (2)  it  is  possible  to  compute  the  just  noticeable  increment  in  decibels  as  a  function 
of  sensation  level,  although  we  know  only  the  ratio  between  I  and  /q  and  not  their 
absolute  values.  When  the  computations  are  carried  through,  the  values  indicated  by 
the  solid  curve  in  Fig.  3  are  obtained.  The  fit  of  this  curve  to  the  data  is  good  enough 
to  justify  the  use  of  Eq.  (2)  to  obtain  smoothed  values  of  the  function. 

It  is  interesting  to  note  that  at  high  intensities  Eq.  (1)  is  equivalent  to  the  well- 
known  "Weber's  Law,"  which  states  that  the  size  of  a  just  noticeable  difference  is 
proportional  to  the  intensity  to  which  it  is  added.  Differential  sensitivity  charac- 
teristically departs  from  Weber's  Law  at  low  intensities,  and  Fechner  long  ago  sug- 
gested a  modification  of  the  law  to  the  form  expressed  in  Eq.  (1).^^  The  essential  feature 

1-  H.  Helmholtz,  Treatise  on  physiological  optics  (translated  by  P.  C.  Southall  from  3rd 
German  ed..  Vol.  II.  The  sensations  of  vision,  1911),  Optical  Society  of  America,  1924,  pp.  172- 
181. 
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of  this  equation  is  the  rectilinear  relation  between  A/  and  /;  the  obvious  difficulty  is 
the  explanation  of  the  intercept  value  bl^  which  appears  in  Eq.  (1)  as  an  additive  factor. 
Fechner  supposed  that  this  added  term  is  attributable  to  intrinsic,  interfering  stimula- 
tion which  cannot  be  eliminated  in  the  measurement  of  the  difference  limen.  Body 
noises,  the  spontaneous  activity  of  the  auditory  nervous  system,  or  the  thermal  noise 
of  the  air  molecules  have  been  suggested  as  possible  sources  of  this  background  stimula- 
tion, but  proof  of  these  possibilities  is  still  lacking.  For  the  present,  therefore,  we  must 
regard  Eq.  (1)  as  a  purely  empirical  equation. 

Relation  to  Masking 

There  is  an  operational  similarity  between  experiments  designed  to  study 
differential  sensitivity  for  intensity  and  experiments  devised  to  measure  auditory 
masking.  This  similarity  is  usually  obscured  by  a  practical  inclination  to  ignore  the 
special  case  where  one  sound  is  masked  by  another  sound  identical  with  the  first. 

Suppose  we  want  to  know  how  much  a  white  noise  masks  a  white  noise.  What 
experimental  procedures  would  we  adopt?  Obviously,  the  judgment  we  would  ask  the 
listener  to  make  is  the  same  judgment  made  in  the  present  experiment.  In  the  one  case, 
however,  we  present  the  data  to  show  the  smallest  detectable  increment,  while  in  the 
other  we  use  the  same  data  to  determine  the  shift  in  threshold  of  the  masked  sound. 
When  the  masked  and  masking  sounds  are  identical,  the  difference  between  masking 
and  sensitivity  to  changes  in  intensity  lies  only  in  the  way  the  story  is  told. 

A  striking  example  of  this  similarity  is  to  be  found  in  the  work  of  Riesz.  In 
order  to  produce  gradual  changes  in  intensity,  Riesz  used  tones  differing  in  frequency 
by  3  cycles  and  instructed  his  listeners  to  report  the  presence  or  absence  of  beats. 
Although  his  results  are  generally  accepted  as  definitive  measures  of  sensitivity  to 
changes  in  the  intensity  of  pure  tones,  it  is  equally  correct  to  interpret  them  as  measures 
of  the  masking  of  one  tone  by  another  tone  differing  in  frequency  by  3  cycles. 

Let  us,  therefore,  reconsider  the  data  of  Table  I.  In  this  table  we  have  presented 
in  decibels  both  the  sensation  level  of  the  noise  and  the  size  of  the  increment  which  can 
be  heard  50  percent  of  the  time.  How  can  these  data  be  transformed  to  correspond 
with  the  definition  of  masking? 

First,  consider  that  we  are  mixing  two  noises  in  order  to  produce  the  total 
magnitude  /  -I-  A/.  Since  /  is  analogous  to  the  intensity  of  the  masking  sound,  /  +  A/ 
must  equal  the  intensity  of  the  masking  sound  plus  the  intensity  of  the  masked  sound, 
/  -h  /^.   Thus  /,„  =  A/,  and  from  the  definition  of  masking  M  we  can  write 

M  =  10  logio(/„,//o)  =  10  logio(A///o).  (3) 

Because  there  appears  to  be  some  basic  significance  to  the  quantal  unit,  whereas  the 
criterion  of  hearing  50  percent  of  the  increments  is  arbitrary,  we  will  use  the  quantal 
increment  A/p  in  Eq.  (3).  A/q  is  defined  as  0.667  times  the  value  of  the  increment 
which  is  heard  50  percent  of  the  time. 

M-101ogio(A/g//o).  (3a) 

Equation  (3a)  tells  us,  then,  that  the  logarithm  of  the  ratio  of  the  quantal  increment 
to  the  absolute  threshold  is  proportional  to  the  masking  of  a  sound  by  an  identical 
sound. 
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TABLE  II 

Masking  of  White  Noise  by  White  Noise.    Quantal  Increments  in  Decibels  and  the 

Values  of  Masking  Obtained  for  Two  Listeners  as  a  Function  of  the  Sensation  Level 

of  the  Masking  Noise.   Computed  Values  of  Masking  According  to  Eq.  (4). 


Quantal 

increment 

Sensation 

in  decibels 

Masking 

obtained 

Masking 

level 

GM 

SM 

GM 

SM 

computed 

3db 

l.il  db 

2.37  db 

1.61  db 

1.61  db 

1.66  db 

5 

2.21 

1.51 

3.26 

1.18 

1.88 

10 

0.81 

0.81 

3.14 

3.14 

3.00 

12 

0.67 

0.61 

4.22 

3.80 

3.76 

15 

0.58 

0.45 

6.58 

5.39 

5.33 

20 

0.33 

0.37 

9.00 

9.54 

8.99 

25 

0.31 

0.37 

13.73 

14.45 

13.44 

32 

0.27 

0.27 

20.06 

19.97 

20.25 

35 

0.27 

0.34 

23.06 

24.10 

23.22 

45 

0.29 

0.30 

33.33 

33.53 

33.20 

52 

0.27 

0.31 

40.06 

40.73 

40.20 

55 

0.27 

0.34 

42.97 

44.10 

43.20 

70 

0.27 

0.32 

57.97 

58.81 

58.20 

82 

0.22 

0.32 

68.32 

70.81 

70.20 

85 

0.22 

0.33 

72.22 

73.91 

73.20 

100 

0.19 

0.27 

86.43 

88.06 

88.20 

It  is  now  possible  to  determine  the  values  of  A/q  and  Iq  from  the  information 
given  in  Table  I,  and  to  substitute  these  values  into  Eq.  (3a).  The  results  of  converting 
the  differential  thresholds  into  quantal  increments  and  then  into  masked  thresholds  are 
given  in  Table  II  for  the  two  listeners,  and  are  shown  in  Fig.  5  where  masking  is  plotted 
as  a  function  of  the  sensation  level  of  the  masking  noise.  In  addition,  Table  II  contains 
values  of  masking  which  are  computed  when  Eqs.  (1)  and  (3a)  are  combined: 

M  =  101ogio[(f///o)  +bl  (4) 

For  intensities  25  db  or  more  above  threshold,  the  masking  noise  is  about  12  db  more 
intense  than  the  masked  noise. 

The  obvious  next  step  is  to  ask  whether  these  results  correspond  to  the  functions 
obtained  when  noise  is  used  to  mask  tones  or  human  speech.  Fortunately,  we  are 
able  to  answer  this  question.  Hawkins^  has  measured  the  masking  effects  of  noise  on 
tones  and  speech  with  experimental  conditions  and  equipment  directly  comparable 
with  those  used  here. 

Suppose,  for  purposes  of  comparison,  we  choose  to  mask  a  1000-cycle  tone. 
We  find  over  a  wide  range  of  intensities  that  this  particular  white  noise  just  masks  a 
1000-cycle  tone  which  is  20  db  less  intense.  Since  the  corresponding  value  is  12  db 
when  this  noise  masks  itself,  we  conclude  that,  for  this  specific  noise  spectrum,  8  db 
less  energy  is  needed  for  audibility  when  the  energy  is  concentrated  at  1000  c.p.s. 
than  when  the  energy  is  spread  over  the  entire  spectrum.  In  order  to  compare  the 
forms  of  the  two  masking  functions,  therefore,  we  can  subtract  8  db  from  the  level  of 
the  noise  which  masks  the  1000-cycle  tone. 
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Figure  5 
Discriminable  increments  in  intensity  of  white  noise  plotted  in  a  manner  analogous  to  masking 
experiments.    Solid  line  represents  function  obtained  by  Hawkins  for  the  masking  of  tones 

and  speech  by  white  noise. 

When  we  make  this  correction  of  8  db  in  the  noise  level  for  Hawkins'  data  for  a 
1000-cycle  tone  and  plot  the  masking  of  this  tone  as  a  function  of  the  corrected  noise 
intensity,  we  obtain  the  solid  line  shown  in  Fig.  5.  The  correspondence  between  this 
curve,  taken  from  Hawkins'  data,  and  the  points  obtained  in  the  present  experiment 
is  remarkably  close.  The  function  computed  from  Eq.  (4)  falls  too  close  to  Hawkins' 
function  to  warrant  its  separate  presentation  in  Fig.  5. 

The  choice  of  lOOOc.p.s.  is  not  crucial  to  this  correspondence.  As  Fletcher 
and  Munson^''  have  pointed  out,  a  single  function  is  adequate  to  describe  the  masking 
by  noise  of  pure  tones,  if  the  intensity  of  the  noise  is  corrected  by  a  factor  which  is  a 
function  of  the  frequency  of  the  masked  tone.  This  factor  is  given  at  any  frequency/ 
by  the  ratio  R  of  the  intensity  of  the  masked  tone  to  the  intensity  per  cycle  of  the  noise 
at  that  frequency:  R  =  JJI'^.  R  is  experimentally  determined  for  all  frequencies  at 
intensities  well  above  threshold— on  the  rectilinear  portion  of  the  function  shown  in 
Fig.  5. 

For  noises  with  continuous  spectra,  the  masking  of  a  tone  of  frequency  /  can  be 
attributed  to  the  noise  in  the  band  of  frequencies  immediately  adjacent  to  f}^  Con- 
sequently, it  is  convenient  to  relate  the  masking  of  a  tone  of  frequency/to  the  intensity 
per  cycle  of  the  noise  at  /',  and  to  express  this  intensity  in  decibels  re  the  threshold  of 
hearing  at  any  frequency.  This  procedure  gives  10  logio(/~//o)'  which  can  be  regarded 
as  the  sensation  level  at  /of  a  one-cycle  band  of  noise.  The  effective  level  Z  of  the 
noise  at  that  frequency  is  then  defined  as 


Z  =  10log,o(/~//o)  +  lOlogio/?. 


(5) 


13  H.  Fletcher  and  W.  A.  Munson,  Relation  between  loudness  and  masking,  J.  cicoiis. 
Soc.  Am.,  1937,  9,  1-10. 

"  H.  Fletcher,  Auditory  patterns.  Rev.  Mod.  Plivs.,  1940,  12,  47-65. 
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When  the  masking  of  pure  tones  is  plotted  as  a  function  of  Z,  the  relation  between  M 
and  Z  is  found  to  be  independent  of  frequency.  A  single  function  expresses  the  relation 
between  M  and  Z  for  all  frequencies. 

When  we  compare  the  function  relating  M  to  Z  with  the  function  obtained  in 
the  present  experiment,  we  find  that  the  sensation  level  of  the  noise  is  equivalent  to 
Z  +  ll.Sdb.   Therefore, 

///o  =  15.14/?(/~//o). 

Substituting  this  expression  into  Eq.  (4)  gives 

M  =  101ogio[i?(/~//o)  +6].  (6) 

This  equation,  along  with  the  functions  relating  R  and  Iq  to  frequency,  enables  us  to 
compute  the  masking  of  pure  tones  by  any  random  noise  of  known  spectrum.  When 
10  logio(/~//o)  is  greater  than  about  15  db,  b  is  negligible  for  all  frequencies,  and  the 
masking  can  be  computed  more  simply  as  10  log^,/?  +  10  logio(/~//o)- 

Hawkins'  results  show  that  the  function  of  Eq.  (4)  can  also  be  adapted  to  de- 
scribe the  masking  of  human  speech  by  white  noise. 

Thus  the  correspondence  seems  complete.  When  the  masking  and  the  masked 
sounds  are  identical,  masking  and  sensitivity  to  changes  in  intensity  are  equivalent. 
The  results  obtained  with  identical  masking  and  masked  noises  are  directly  comparable 
to  results  obtained  with  different  masked  sounds.  It  is  reasonable  to  conclude,  there- 
fore, that  the  determination  of  sensitivity  to  changes  in  intensity  is  a  special  case  of  the 
more  general  masking  experiment. 

It  is  worth  noting  that  this  interpretation  of  masking  is  also  applicable  to  visual 
sensitivity  to  changes  in  the  intensity  of  white  light.  Data  obtained  by  Graham  and 
Bartlett^^  provide  an  excellent  basis  for  comparison,  because  of  the  similarity  of  their 
procedure  to  that  of  the  masking  experiment,  and  because  they  used  homogeneous, 
rod-free,  foveal  areas  of  the  retina.  When  these  data  are  substituted  into  Eq.  (3)  and 
plotted  as  measures  of  visual  masking,  the  result  can  be  described  by  the  same  general 
function  that  we  have  used  to  express  the  auditory  masking  by  noise  of  tones,  speech, 
and  noise. 

Relation  to  Loudness 

When  Fechner  adopted  the  just  noticeable  difference  as  the  unit  for  sensory 
scales,  he  precipitated  a  controversy  which  is  still  alive  today :  Are  equally-often-noticed 
differences  subjectively  equal  ?  In  the  case  of  auditory  loudness,  the  answer  seems  to  be 
negative.  Just  noticeable  differences  (j.n.d.'s)  at  high  intensities  are  subjectively  much 
larger  than  j.n.d.'s  at  low  intensities. 

^^  C.  H.  Graham  and  N.  R.  Bartlett,  The  relation  of  stimulus  and  intensity  in  the  human 
eye:  III.  The  influence  of  area  on  foveal  intensity  discrimination,  J.  exper.  Psychol.,  1940,  27, 
149-159. 

Crozier  has  used  similar  visual  data  to  demonstrate  that  the  reciprocal  of  the  just 
detectable  increment  is  related  to  the  logarithm  of  the  light  intensity  by  a  normal  probability 
integral.  This  is  deduced  on  the  assumption  that  sensitivity  is  determined  by  the  not-already- 
excited  portion  of  the  total  population  of  potentially  excitable  neural  effects.  Crozier's  equa- 
tions give  an  excellent  description  of  the  auditory  data  presented  here.  W.  J.  Crozier,  On  the 
law  for  minimal  discrimination  of  intensities.  IV.  A/ as  a  function  of  intensity,  Proc.  Nat.  Acad. 
Sci.,  1940,  26,  382-388. 
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In  order  to  demonstrate  that  such  is  the  case  for  noise  as  well  as  for  pure  tones, 
we  need  two  kinds  of  information.  We  need  to  know  the  functions  relating  noise  intensity 
to  the  number  of  distinguishable  steps  above  threshold,  and  to  the  subjective  loudness 
of  the  noise  in  sones.  If  these  two  functions  correspond,  Fechner  was  right  and  j.n.d.'s 
can  be  used  as  units  on  a  subjective  loudness-scale.  If.  they  do  not  agree,  Fechner  was 
wrong,  and  the  picture  is  more  complex  than  he  imagined. 

TABLE  III 

Loudness  and  the  Number  of  Quanta.    Sensation-Level  of  Equally  Loud  1000-Cycle 

Tone  as  a  Function  of  Sensation-Level  of  Noise,  with  Corresponding  Loudness  in 

Sones.    Data  for  12  Listeners.    The  Last  Column  Gives  the  Corresponding  Number 

of  Quantal  Units  Above  Threshold. 


Equally  loud 

1000  c.p.s. 

Sensation- 

Sensation       Stand. 

Loudness  in  sones 

No. 

of  quanta 

level  of  noise 

level             dev. 

Mean 

dbStand.  dev. 

above  threshold 

15  db 

14.2  db        4.6  db 

0.036 

0.015-  0.081 

13 

30 

38.1              6.9 

0.83 

0.40  -  1.6 

58 

45 

57.9              9.1 

4.8 

2.3     -  9.7 

111 

60 

74.2               8.2 

17.0 

9       -26 

163 

75 

86.3              7.2 

37 

24       -47 

216 

90 

97.9             3.1 

76 

62       -88 

268 

The  number  of  differential  quanta  Nq  corresponding  to  a  given  sensation  level 
of  noise  is  readily  obtained  by  "stepping  off"  the  quant.al  increments  against  a  scale  of 
decibels.  The  procedure  consists  of  finding  the  number  of  quantal  increments  per  unit 
of  intensity  and  then  integrating: 


NQ=(llMQ-dI.  (7) 

e  qu 


^Q-    77-rTr-dr  =  -JnMQ+C.  (8) 


If  we  substitute  for  the  size  of  the  quantal  increment  according  to  Eq.  (1), 

1  1 

,  —TT  -dl  =  - 
cl  +  6/o  c 

When  we  convert  to  logarithms  to  the  base  10,  insert  the  values  for  the  constants,  and 
solve  in  terms  of  masking  M,  we  find  that 

Nq  =  3.49M  +  K.  (9) 

We  assume  that  the  number  of  quantal  increments  is  zero  when  /  =  /„,  and  at  this  point 
Eq.  (4)  indicates  that  M  =  1.46  db.  Therefore,  K  =  —5.1.  Values  of  Nq  obtained  by 
Eq.  (9)  are  given  in  Table  III,  and  plotted  in  Fig.  7. 

The  loudness  in  sones  was  determined  by  requiring  listeners  to  equate  the  loud- 
ness of  the  noise  with  the  loudness  of  a  1000-cycle  tone.  The  two  sounds  were  presented 
alternately  to  the  same  ear,  and  the  listener  adjusted  the  intensity  of  the  tone.  Five 
equations  were  made  by  each  of  twelve  listeners  for  the  six  noise-intensities  studied. 
The  result  of  this  experiment — the  level  of  the  1000-cycle  tone  which  sounds  equal  in 
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Figure  6 

Observed  and  computed  values  of  the  loudness  level  of  white  noise.   Standard  deviations  of 

the  values  for  15  listeners  are  indicated  by  the  lengths  of  the  vertical  bars. 

loudness  to  the  noise — defines  the  loudness  level  of  the  noise.  With  these  data,  which 
are  tabulated  in  Table  III  and  plotted  in  Fig.  6,  the  loudness  in  sones  is  determined 
from  the  loudness-scale  which  has  been  constructed  for  the  1000-cycle  tone.  The 
values  in  sones  from  Stevens'  loudness-scale^^  are  included  in  Table  III.  Table  III 
also  gives  the  standard  deviations  of  the  distributions  of  loudness  levels  obtained  for 
the  12  listeners. 

Loudness  can  also  be  computed.   Fletcher  and  Munson  developed  a  procedure 
for  calculating  loudness  from  the  masking  which  the  sound  produces.    When  this 
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Figure  7 
Comparison  of  the  number  of  discriminatory  quanta  with  the  loudness  of  white  noise.    Just 
noticeable  increments  of  intensity  are  not  subjectively  equal. 
!'■'  S.  S.  Stevens  and  H.  Davis,  Hearing,  New  York:   Wiley,  1938,  p.  118. 
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procedure  is  applied  to  Hawtcins'  data  for  the  masking  of  pure  tones  by  noise,  we  get 
the  computed  values  shown  in  Fig.  6.  The  agreement  between  computed  and  experi- 
mental results  is  quite  satisfactory. 

We  are  now  equipped  to  present  the  two  functions  shown  in  Fig.  7.   The  solid 
curve  shows  the  number  of  quantal  units  as  a  function  of  sensation  level.  The  dashed 
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Figure  8 
Relation  between  loudness  and  masking  for  white  noise. 

curve  shows  the  loudness  in  sones.  The  discrepancy  between  these  two  curves  affirms 
the  error  of  Fechner's  assumption.  Loudness  and  the  number  of  just  noticeable  dif- 
ferences are  not  linearly  related. 

When,  as  in  the  present  case,  two  variables  are  both  related  to  a  third,  it  is 
possible  to  determine  their  relation  to  each  other.   Stevens^^  has  used  Riesz's  data  for 

"  S.  S.  Stevens,  A  scale  for  the  measurement  of  a  psychological  magnitude:   loudness, 
Psychol.  Rev.,  1936,  43,  405-416. 
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pure  tones  to  arrive  at  the  empirical  equation  L  =  kN^-^,  where  L  is  the  loudness  of 
the  tone  in  sones,  k  is  the  size  in  sones  of  the  first  step,  and  N  is  the  number  of  dis- 
tinguishable steps.  When  we  parallel  Stevens'  computation  with  the  data  for  noise 
presented  in  Table  III,  we  find  that  L  =  kN^  describes  the  relation  rather  well  over 
most  of  the  range.  It  is  interesting  that  both  turn  out  to  be  power  functions,  but  why 
the  exponent  should  be  different  for  noise  and  tones  is  not  apparent. 

There  is  an  alternative  way  to  state  the  relation  between  differential  sensitivity 
and  loudness.  In  the  preceding  section  we  developed  the  notion  that  sensitivity  to 
changes  in  intensity  is  a  special  case  of  masking,  and  we  computed  the  masking  of  the 
noise  on  itself.  Let  us  now  examine  the  relation  between  masking,  so  defined,  and  the 
subjective  loudness.  In  Fig.  5  and  Table  II  masking  is  related  to  sensation  level;  in 
Fig.  7  and  Table  III  loudness  is  related  to  sensation  level.  The  relation  of  masking  to 
loudness  is  obtained  by  combining  these  two  functions.  In  Fig.  8  it  can  be  seen  that 
the  expression  L  =  KM^  fits  the  data  rather  well.  The  loudness  of  a  white  noise  in- 
creases in  proportion  to  the  third  power  of  the  masking  produced  by  the  noise  on  itself, 
i.e.,  the  third  power  of  the  logarithm  of  the  quantal  increment  in  intensity.  In  whatever 
form  we  cast  the  empirical  equation,  however,  it  is  obvious  that  faint  j.n.d.'s  are  smaller 
than  loud  j.n.d.'s  and  that  j.n.d.'s  are  not  equal  units  along  a  scale  of  loudness. 
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Correction 

In  1963  D.  H.  Raab,  E.  Osman,  and  E.  Rich  noticed  an  error  in  Eq.  (3),  which 
is  written  as  if  the  masked  and  masking  noises  had  been  generated  independently.  In 
fact,  however,  the  two  noises  were  perfectly  correlated  (cf.  Fig.  1),  so  their  sound 
pressures  added  in  phase;  their  combined  power  was  the  square  of  their  summed 
pressures,  not  the  sum  of  their  squared  pressures.  When  the  amount  of  masking  is 
recomputed  from  Table  I  using  M  =  20  log^o  (AP/Pq),  then  for  intensities  25  db  or 
more  above  threshold,  the  masking  noise  is  about  25  db  more  intense  than  the  masked 
noise  (not  12  db  as  stated  on  page  128).  At  sensation  levels  below  about  25  db, 
therefore,  there  was  facilitation  (negative  masking)  instead  of  masking;  listeners 
were  able  to  hear  in-phase  increments  which  would  have  been  inaudible  if  presented 
alone  in  the  absence  of  the  "masking"  noise.  This  fact  was  verified  directly  by  Raab, 
Osman,  and  Rich;  a  similar  effect  for  sinusoids  has  been  reported  by  S.  M.  Pfafflin 
and  M.  V.  Mathews,  Energy-detection  model  for  monaural  auditory  detection,  /. 
acoust.  Soc.  Am.,  1962,  34,  1842-1853. 

When  this  correction  is  made,  of  course,  the  relation  shown  in  Fig.  8  no  longer 
obtains. 
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My  problem  is  that  I  have  been  perse- 
cuted by  an  integer.  For  seven  years 
this  number  has  followed  me  around,  has 
intruded  in  my  most  private  data,  and 
has  assaulted  me  from  the  pages  of  our 
most  public  journals.  This  number  as- 
sumes a  variety  of  disguises,  being  some- 
times a  little  larger  and  sometimes  a 
little  smaller  than  usual,  but  never 
changing  so  much  as  to  be  unrecogniz- 
able. The  persistence  with  which  this 
number  plagues  me  is  far  more  than 
a  random  accident.  There  is,  to  quote 
a  famous  senator,  a  design  behind  it, 
some  pattern  governing  its  appearances. 
Either  there  really  is  something  unusual 
about  the  number  or  else  I  am  suffering 
from  delusions  of  persecution. 

I  shall  begin  my  case  history  by  tell- 
ing you  about  some  experiments  that 
tested  how  accurately  people  can  assign 
numbers  to  the  magnitudes  of  various 
aspects  of  a  stimulus.  In  the  tradi- 
tional language  of  psychology  these 
would  be  called  experiments  in  absolute 

1  This  paper  was  first  read  as  an  Invited 
Address  before  the  Eastern  Psychological  As- 
sociation in  Philadelphia  on  April  15,  1955. 
Preparation  of  the  paper  was  supported  by 
the  Harvard  Psycho-Acoustic  Laboratory  un- 
der Contract  NSori-76  between  Harvard  Uni- 
versity and  the  Office  of  Naval  Research,  U.  S. 
Navy  (Project  NR142-201,  Report  PNR-174). 
Reproduction  for  any  purpose  of  the  U.  S. 
Government  is  permitted. 


judgment.  Historical  accident,  how- 
ever, has  decreed  that  they  should  have 
another  name.  We  now  call  them  ex- 
periments on  the  capacity  of  people  to 
transmit  information.  Since  these  ex- 
periments would  not  have  been  done 
without  the  appearance  of  information 
theory  on  the  psychological  scene,  and 
since  the  results  are  analyzed  in  terms 
of  the  concepts  of  information  theory, 
I  shall  have  to  preface  my  discussion 
with  a  few  remarks  about  this  theory. 

Information  Measurement 

The  "amount  of  information"  is  ex- 
actly the  same  concept  that  we  have 
talked  about  for  years  under  the  name 
of  "variance."  The  equations  are  dif- 
ferent, but  if  we  hold  tight  to  the  idea 
that  anything  that  increases  the  vari- 
ance also  increases  the  amount  of  infor- 
mation we  cannot  go  far  astray. 

The  advantages  of  this  new  way 
of  talking  about  variance  are  simple 
enough.  Variance  is  always  stated  in 
terms  of  the  unit  of  measurement — 
inches,  pounds,  volts,  etc. — whereas  the 
amount  of  information  is  a  dimension- 
less  quantity.  Since  the  information  in 
a  discrete  statistical  distribution  does 
not  depend  upon  the  unit  of  measure- 
ment, we  can  extend  the  concept  to 
situations  where  we  have  no  metric  and 
we  would  not  ordinarily  think  of  using 
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the  variance.  And  it  also  enables  us  to 
compare  results  obtained  in  quite  dif- 
ferent experimental  situations  where  it 
would  be  meaningless  to  compare  vari- 
ances based  on  different  metrics.  So 
there  are  some  good  reasons  for  adopt- 
ing the  newer  concept. 

The  similarity  of  variance  and  amount 
of  information  might  be  explained  this 
way:  When  we  have  a  large  variance, 
we  are  very  ignorant  about  what  is  go- 
ing to  happen.  If  we  are  very  ignorant, 
then  when  we  make  the  observation  it 
gives  us  a  lot  of  information.  On  the 
other  hand,  if  the  variance  is  very  small, 
we  know  in  advance  how  our  observa- 
tion must  come  out,  so  we  get  little  in- 
formation from  making  the  observation. 

If  you  will  now  imagine  a  communi- 
cation system,  you  will  realize  that 
there  is  a  great  deal  of  variability  about 
what  goes  into  the  system  and  also  a 
great  deal  of  variability  about  what 
comes  out.  The  input  and  the  output 
can  therefore  be  described  in  terms  of 
their  variance  (or  their  information). 
If  it  is  a  good  communication  system, 
however,  there  must  be  some  system- 
atic relation  between  what  goes  in  and 
what  comes  out.  That  is  to  say,  the 
output  will  depend  upon  the  input,  or 
will  be  correlated  with  the  input.  If  we 
measure  this  correlation,  then  we  can 
say  how  much  of  the  output  variance  is 
attributable  to  the  input  and  how  much 
is  due  to  random  fluctuations  or  "noise" 
introduced  by  the  system  during  trans- 
mission. So  we  see  that  the  measure 
of  transmitted  information  is  simply  a 
measure  of  the  input-output  correlation. 

There  are  two  simple  rules  to  follow. 
Whenever  I  refer  to  "amount  of  in- 
formation," you  will  understand  "vari- 
ance." And  whenever  I  refer  to  "amount 
of  transmitted  information,"  you  will 
understand  "covariance"  or  "correla- 
tion." 

The  situation  can  be  described  graphi- 
cally by  two  partially  overlapping  cir- 


cles. Then  the  left  circle  can  be  taken 
to  represent  the  variance  of  the  input, 
the  right  circle  the  variance  of  the  out- 
put, and  the  overlap  the  covariance  of 
input  and  output.  I  shall  speak  of  the 
left  circle  as  the  amount  of  input  infor- 
mation, the  right  circle  as  the  amount 
of  output,  information,  and  the  overlap 
as  the  amount  of  transmitted  informa- 
tion. 

In  the  experiments  on  absolute  judg- 
ment, the  observer  is  considered  to  be 
a  communication  channel.  Then  the 
left  circle  would  represent  the  amount 
of  information  in  the  stimuli,  the  right 
circle  the  amount  of  information  in  his 
responses,  and  the  overlap  the  stimulus- 
response  correlation  as  measured  by  the 
amount  of  transmitted  information.  The 
experimental  problem  is  to  increase  the 
amount  of  input  information  and  to 
measure  the  amount  of  transmitted  in- 
formation. If  the  observer's  absolute 
judgments  are  quite  accurate,  then 
nearly  all  of  the  input  information  will 
be  transmitted  and  will  be  recoverable 
from  his  responses.  If  he  makes  errors, 
then  the  transmitted  information  may 
be  considerably  less  than  the  input.  We 
expect  that,  as  we  increase  the  amount 
of  input  information,  the  observer  will 
begin  to  make  more  and  more  errors; 
we  can  test  the  limits  of  accuracy  of  his 
absolute  judgments.  If  the  human  ob- 
server is  a  reasonable  kind  of  communi- 
cation system,  then  when  we  increase 
the  amount  of  input  information  the 
transmitted  information  will  increase  at 
first  and  will  eventually  level  off  at  some 
asymptotic  value.  This  asymptotic  value 
we  take  to  be  the  channel  capacity  of 
the  observer:  it  represents  the  greatest 
amount  of  information  that  he  can  give 
us  about  the  stimulus  on  the  basis  of 
an  absolute  judgment.  The  channel  ca- 
pacity is  the  upper  limit  on  the  extent 
to  which  the  observer  can  match  his  re- 
sponses to  the  stimuli  we  give  him. 

Now  just  a  brief  word  about  the  bit 
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and  we  can  begin  to  look  at  some  data. 
One  bit  of  information  is  the  amount  of 
information  that  we  need  to  make  a 
decision  between  two  equally  likely  al- 
ternatives. If  we  must  decide  whether 
a  man  is  less  than  six  feet  tall  or  more 
than  six  feet  tall  and  if  we  know  that 
the  chances  are  50-50,  then  we  need 
one  bit  of  information.  Notice  that 
this  unit  of  information  does  not  refer 
in  any  way  to  the  unit  of  length  that 
we  use — feet,  inches,  centimeters,  etc. 
However  you  measure  the  man's  height, 
we  still  need  just  one  bit  of  information. 

Two  bits  of  information  enable  us  to 
decide  among  four  equally  likely  alter- 
natives. Three  bits  of  information  en- 
able us  to  decide  among  eight  equally 
likely  alternatives.  Four  bits  of  infor- 
mation decide  among  16  alternatives, 
five  among  lyl,  and  so  on.  That  is  to 
say,  if  there  are  12  equally  likely  alter- 
natives, we  must  make  five  successive 
binary  decisions,  worth  one  bit  each,  be- 
fore we  know  which  alternative  is  cor- 
rect. So  the  general  rule  is  simple: 
every  time  the  number  of  alternatives 
is  increased  by  a  factor  of  two,  one  bit 
of  information  is  added. 

There  are  two  ways  we  might  in- 
crease the  amount  of  input  information. 
We  could  increase  the  rate  at  which  we 
give  information  to  the  observer,  so  that 
the  amount  of  information  per  unit  time 
would  increase.  Or  we  could  ignore  the 
time  variable  completely  and  increase 
the  amount  of  input  information  by 
increasing  the  number  of  alternative 
stimuli.  In  the  absolute  judgment  ex- 
periment we  are  interested  in  the  second 
alternative.  We  give  the  observer  as 
much  time  as  he  wants  to  make  his  re- 
sponse; we  simply  increase  the  number 
of  alternative  stimuli  among  which  he 
must  discriminate  and  look  to  see  where 
confusions  begin  to  occur.  Confusions 
will  appear  near  the  point  that  we  are 
calling  his  "channel  capacity." 


Absolute  Judgments  of  Uni- 
dimensional  stimuli 

Now  let  us  consider  what  happens 
when  we  make  absolute  judgments  of 
tones.  Pollack  (17)  asked  listeners  to 
identify  tones  by  assigning  numerals  to 
them.  The  tones  were  different  with  re- 
spect to  frequency,  and  covered  the 
range  from  100  to  8000  cps  in  equal 
logarithmic  steps.  A  tone  was  sounded 
and  the  listener  responded  by  giving  a 
numeral.  After  the  listener  had  made 
his  response  he  was  told  the  correct 
identification  of  the  tone. 

When  only  two  or  three  tones  were 
used  the  listeners  never  confused  them. 
With  four  different  tones  confusions 
were  quite  rare,  but  with  five  or  more 
tones  confusions  were  frequent.  With 
fourteen  different  tones  the  listeners 
made  many  mistakes. 

These  data  are  plotted  in  Fig.  1. 
Along  the  bottom  is  the  amount  of  in- 
put information  in  bits  per  stimulus. 
As  the  number  of  alternative  tones  was 
increased  from  2  to  14,  the  input  infor- 
mation increased  from  1  to  3.8  bits.  On 
the  ordinate  is  plotted  the  amount  of 
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Fig.  1.  Data  from  Pollack  (17,  18)  on  the 
amount  of  information  that  is  transmitted  by 
listeners  who  make  absolute  judgments  of 
auditory  pitch.  As  the  amount  of  input  in- 
formation is  increased  by  increasing  from  2 
to  14  the  number  of  different  pitches  to  be 
judged,  the  amount  of  transmitted  informa- 
tion approaches  as  its  upper  limit  a  channel 
capacity  of  about  2.5  bits  per  judgment. 
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transmitted  information.  The  amount 
of  transmitted  information  behaves  in 
much  the  way  we  would  expect  a  com- 
munication channel  to  behave;  the  trans- 
mitted information  increases  linearly  up 
to  about  2  bits  and  then  bends  off  to- 
ward an  asymptote  at  about  2.5  bits. 
This  value,  2.5  bits,  therefore,  is  what 
we  are  calling  the  channel  capacity  of 
the  listener  for  absolute  judgments  of 
pitch. 

So  now  we  have  the  number  2.5 
bits.  What  does  it  mean?  First,  note 
that  2.5  bits  corresponds  to  about  six 
equally  likely  alternatives.  The  result 
means  that  we  cannot  pick  more  than 
six  different  pitches  that  the  listener  will 
never  confuse.  Or,  stated  slightly  dif- 
ferently, no  matter  how  many  alterna- 
tive tones  we  ask  him  to  judge,  the  best 
we  can  expect  him  to  do  is  to  assign 
them  to  about  six  different  classes  with- 
out error.  Or,  again,  if  we  know  that 
there  were  N  alternative  stimuli,  then 
his  judgment  enables  us  to  narrow  down 
the  particular  stimulus  to  one  out  of 
N/6. 

Most  people  are  surprised  that  the 
number  is  as  small  as  six.  Of  course, 
there  is  evidence  that  a  musically  so- 
phisticated person  with  absolute  pitch 
can  identify  accurately  any  one  of  SO 
or  60  different  pitches.  Fortunately,  I 
do  not  have  time  to  discuss  these  re- 
markable exceptions.  I  say  it  is  for- 
tunate because  I  do  not  know  how  to 
explain  their  superior  performance.  So 
I  shall  stick  to  the  more  pedestrian  fact 
that  most  of  us  can  identify  about  one 
out  of  only  five  or  six  pitches  before  we 
begin  to  get  confused. 

It  is  interesting  to  consider  that  psy- 
chologists have  been  using  seven-point 
rating  scales  for  a  long  time,  on  the 
intuitive  basis  that  trying  to  rate  into 
finer  categories  does  not  really  add  much 
to  the  usefulness  of  the  ratings.  Pol- 
lack's results  indicate  that,  at  least  for 
pitches,  this  intuition  is  fairly  sound. 
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Fig.  2.  Data  from  Garner  (7)  on  the  chan- 
nel capacity  for  absolute  judgments  of  audi- 
tory loudness. 

Next  you  can  ask  how  reproducible 
this  result  is.  Does  it  depend  on  the 
spacing  of  the  tones  or  the  various  con- 
ditions of  judgment?  Pollack  varied 
these  conditions  in  a  number  of  ways. 
The  range  of  frequencies  can  be  changed 
by  a  factor  of  about  20  without  chang- 
ing the  amount  of  information  trans- 
mitted more  than  a  small  percentage. 
Different  groupings  of  the  pitches  de- 
creased the  transmission,  but  the  loss 
was  small.  For  example,  if  you  can 
discriminate  five  high-pitched  tones  in 
one  series  and  five  low-pitched  tones  in 
another  series,  it  is  reasonable  to  ex- 
pect that  you  could  combine  all  ten  into 
a  single  series  and  still  tell  them  all 
apart  without  error.  When  you  try  it, 
however,  it  does  not  work.  The  chan- 
nel capacity  for  pitch  seems  to  be  about 
six  and  that  is  the  best  you  can  do. 

While  we  are  on  tones,  let  us  look 
next  at  Garner's  (7)  work  on  loudness. 
Garner's  data  for  loudness  are  sum- 
marized in  Fig.  2.  Garner  went  to  some 
trouble  to  get  the  best  possible  spacing 
of  his  tones  over  the  intensity  range 
from  15  to  110  db.  He  used  4,  5,  6,  7, 
10,  and  20  different  stimulus  intensities. 
The  results  shown  in  Fig.  2  take  into 
account  the  differences  among  subjects 
and  the  sequential  influence  of  the  im- 
mediately preceding  judgment.  Again 
we  find  that  there  seems  to  be  a  limit. 
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Fig.  3.  Data  from  Beebe-Center,  Rogers, 
and  O'Connell  (1)  on  the  channel  capacity  for 
absolute  judgments  of  saltiness. 

The  channel  capacity  for  absolute  judg- 
ments of  loudness  is  2.3  bits,  or  about 
five  perfectly  discriminable  alternatives. 

Since  these  two  studies  were  done  in 
different  laboratories  with  slightly  dif- 
ferent techniques  and  methods  of  analy- 
sis, we  are  not  in  a  good  position  to 
argue  whether  five  loudnesses  is  signifi- 
cantly different  from  six  pitches.  Prob- 
ably the  difference  is  in  the  right  direc- 
tion, and  absolute  judgments  of  pitch 
are  slightly  more  accurate  than  absolute 
judgments  of  loudness.  The  important 
point,  however,  is  that  the  two  answers 
are  of  the  same  order  of  magnitude. 

The  experiment  has  also  been  done 
for  taste  intensities.  In  Fig.  3  are  the 
results  obtained  by  Beebe-Center,  Rog- 
ers, and  O'Connell  (1)  for  absolute 
judgments  of  the  concentration  of  salt 
solutions.  The  concentrations  ranged 
from  0.3  to  34.7  gm.  NaCl  per  100 
cc.  tap  water  in  equal  subjective  steps. 
They  used  3,  5,  9,  and  17  different  con- 
centrations. The  channel  capacity  is 
1.9  bits,  which  is  about  four  distinct 
concentrations.  Thus  taste  intensities 
seem  a  little  less  distinctive  than  audi- 
tory stimuli,  but  again  the  order  of 
magnitude  is  not  far  off. 

On  the  other  hand,  the  channel  ca- 
pacity for  judgments  of  visual  position 
seems  to  be  significantly  larger.     Hake 


and  Garner  (8)  asked  observers  to  in- 
terpolate visually  between  two  scale 
markers.  Their  results  are  shown  in 
Fig.  4.  They  did  the  experiment  in 
two  ways.  In  one  version  they  let  the 
observer  use  any  number  between  zero 
and  100  to  describe  the  position,  al- 
though they  presented  stimuli  at  only 
5,  10,  20,  or  SO  different  positions.  The 
results  with  this  unlimited  response 
technique  are  shown  by  the  filled  circles 
on  the  graph.  In  the  other  version  the 
observers  were  limited  in  their  re- 
sponses to  reporting  just  those  stimu- 
lus values  that  were  possible.  That  is 
to  say,  in  the  second  version  the  num- 
ber of  different  responses  that  the  ob- 
server could  make  was  exactly  the  same 
as  the  number  of  different  stimuli  that 
the  experimenter  might  present.  The 
results  with  this  limited  response  tech- 
nique are  shown  by  the  open  circles  on 
the  graph.  The  two  functions  are  so 
similar  that  it  seems  fair  to  conclude 
that  the  number  of  responses  available 
to  the  observer  had  nothing  to  do  with 
the  channel  capacity  of  3.25  bits. 

The  Hake-Garner  experiment  has  been 
repeated  by  Coonan  and  Klemmer.  Al- 
though they  have  not  yet  published 
their  results,  they  have  given  me  per- 
mission to  say  that  they  obtained  chan- 
nel capacities  ranging  from  3.2  bits  for 
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very  short  exposures  of  the  pointer  po- 
sition to  3.9  bits  for  longer  exposures. 
These  values  are  slightly  higher  than 
Hake  and  Garner's,  so  we  must  con- 
clude that  there  are  between  10  and  IS 
distinct  positions  along  a  linear  inter- 
val. This  is  the  largest  channel  ca- 
pacity that  has  been  measured  for  any 
unidimensional  variable. 

At  the  present  time  these  four  experi- 
ments on  absolute  judgments  of  simple, 
unidimensional  stimuli  are  all  that  have 
appeared  in  the  psychological  journals. 
However,  a  great  deal  of  work  on  other 
stimulus  variables  has  not  yet  appeared 
in  the  journals.  For  example,  Eriksen 
and  Hake  (6)  have  found  that  the 
channel  capacity  for  judging  the  sizes 
of  squares  is  2.1  bits,  or  about  five 
categories,  under  a  wide  range  of  ex- 
perimental conditions.  In  a  separate 
experiment  Eriksen  (5)  found  2.8  bits 
for  size,  3.1  bits  for  hue,  and  l.Z  bits 
for  brightness.  Geldard  has  measured 
the  channel  capacity  for  the  skin  by 
placing  vibrators  on  the  chest  region. 
A  good  observer  can  identify  about  four 
intensities,  about  five  durations,  and 
about  seven  locations. 

One  of  the  most  active  groups  in  this 
area  has  been  the  Air  Force  Operational 
Applications  Laboratory.  Pollack  has 
been  kind  enough  to  furnish  me  with 
the  results  of  their  measurements  for 
several  aspects  of  visual  displays.  They 
made  measurements  for  area  and  for 
the  curvature,  length,  and  direction  of 
lines.  In  one  set  of  experiments  they 
used  a  very  short  exposure  of  the  stimu- 
lus— %o  second — and  then  they  re- 
peated the  measurements  with  a  5- 
second   exposure.      For    area   they   got 

2.6  bits  v/ith   the   short  exposure   and 

2.7  bits  with  the  long  exposure.  For 
the  length  of  a  line  they  got  about  2.6 
bits  with  the  short  exposure  and  about 
3.0  bits  with  the  long  exposure.  Direc- 
tion, or  angle  of  inclination,  gave  2.8 
bits  for  the  short  exposure  and  3.3  bits 


for  the  long  exposure.  Curvature  was 
apparently  harder  to  judge.  When  the 
length  of  the  arc  was  constant,  the  re- 
sult at  the  short  exposure  duration  was 
2.2  bits,  but  when  the  length  of  the 
chord  was  constant,  the  result  was  only 
1.6  bits.  This  last  value  is  the  lowest 
that  anyone  has  measured  to  date.  I 
should  add,  however,  that  these  values 
are  apt  to  be  slightly  too  low  because 
the  data  from  all  subjects  were  pooled 
before  the  transmitted  information  was 
computed. 

Now  let  us  see  where  we  are.  First, 
the  channel  capacity  does  seem  to  be  a 
valid  notion  for  describing  human  ob- 
servers. Second,  the  channel  capacities 
measured  for  these  unidimensional  vari- 
ables range  from  1.6  bits  for  curvature 
to  3.9  bits  for  positions  in  an  interval. 
Although  there  is  no  question  that  the 
differences  among  the  variables  are  real 
and  meaningful,  the  more  impressive 
fact  to  me  is  their  considerable  simi- 
larity. If  I  take  the  best  estimates  I 
can  get  of  the  channel  capacities  for  all 
the  stimulus  variables  I  have  mentioned, 
the  mean  is  2.6  bits  and  the  standard 
deviation  is  only  0.6  bit.  In  terms  of 
distinguishable  alternatives,  this  mean 
corresponds  to  about  6.5  categories,  one 
standard  deviation  includes  from  4  to 
10  categories,  and  the  total  range  is 
from  3  to  15  categories.  Considering 
the  wide  variety  of  different  variables 
that  have  been  studied,  I  find  this  to 
be  a  remarkably  narrow  range. 

There  seems  to  be  some  limitation 
built  into  us  either  by  learning  or  by 
the  design  of  our  nervous  systems,  a 
limit  that  keeps  our  channel  capacities 
in  this  general  range.  On  the  basis  of 
the  present  evidence  it  seems  safe  to 
say  that  we  possess  a  finite  and  rather 
small  capacity  for  making  such  unidi- 
mensional judgments  and  that  this  ca- 
pacity does  not  vary  a  great  deal  from 
one  simple  sensory  attribute  to  another. 
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Absolute  Judgments  of  Multi- 
dimensional Stimuli 

You  may  have  noticed  that  I  have 
been  careful  to  say  that  this  magical 
number  seven  applies  to  one-dimensional 
judgments.  Everyday  experience  teaches 
us  that  we  can  identify  accurately  any 
one  of  several  hundred  faces,  any  one 
of  several  thousand  words,  any  one  of 
several  thousand  objects,  etc.  The  story 
certainly  would  not  be  complete  if  we 
stopped  at  this  point.  We  must  have 
some  understanding  of  why  the  one- 
dimensional  variables  we  judge  in  the 
laboratory  give  results  so  far  out  of 
line  with  what  we  do  constantly  in  our 
behavior  outside  the  laboratory.  A  pos- 
sible explanation  lies  in  the  number  of 
independently  variable  attributes  of  the 
stimuli  that  are  being  judged.  Objects, 
faces,  words,  and  the  like  differ  from 
one  another  in  many  ways,  whereas  the 
simple  stimuli  we  have  considered  thus 
far  differ  from  one  another  in  only  one 
respect. 

Fortunately,  there  are  a  few  data  on 
what  happens  when  we  make  absolute 
judgments  of  stimuli  that  differ  from 
one  another  in  several  ways.  Let  us 
look  first  at  the  results  Klemmer  and 
Frick  (13)  have  reported  for  the  abso- 
lute judgment  of  the  position  of  a  dot 
in  a  square.    In  Fig.  5  we  see  their  re- 
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Fig.  S.  Data  from  Klemmer  and  Frick  (13) 
on  the  channel  capacity  for  absolute  judg- 
ments of  the  position  of  a  dot  in  a  square. 


suits.  Now  the  channel  capacity  seems 
to  have  increased  to  4.6  bits,  which 
means  that  people  can  identify  accu- 
rately any  one  of  24  positions  in  the 
square. 

The  position  of  a  dot  in  a  square  is 
clearly  a  two-dimensional  proposition. 
Both  its  horizontal  and  its  vertical  po- 
sition must  be  identified.  Thus  it  seems 
natural  to  compare  the  4.6-bit  capacity 
for  a  square  with  the  3.25-bit  capacity 
for  the  position  of  a  point  in  an  inter- 
val. The  point  in  the  square  requires 
two  judgments  of  the  interval  type.  If 
we  have  a  capacity  of  3.25  bits  for  esti- 
mating intervals  and  we  do  this  twice, 
we  should  get  6.5  bits  as  our  capacity 
for  locating  points  in  a  square.  Adding 
the  second  independent  dimension  gives 
us  an  increase  from  3.25  to  4.6,  but  it 
falls  short  of  the  perfect  addition  that 
would  give  6.5  bits. 

Another  example  is  provided  by  Beebe- 
Center,  Rogers,  and  O'Connell.  When 
they  asked  people  to  identify  both  the 
saltiness  and  the  sweetness  of  solutions 
containing  various  concentrations  of  salt 
and  sucrose,  they  found  that  the  chan- 
nel capacity  was  2.3  bits.  Since  the  ca- 
pacity for  salt  alone  was  1.9,  we  might 
expect  about  3.8  bits  if  the  two  aspects 
of  the  compound  stimuli  were  judged 
independently.  As  with  spatial  loca- 
tions, the  second  dimension  adds  a  little 
to  the  capacity  but  not  as  much  as  it 
conceivably  might. 

A  third  example  is  provided  by  Pol- 
lack (18),  who  asked  listeners  to  judge 
both  the  loudness  and  the  pitch  of  pure 
tones.  Since  pitch  gives  2.5  bits  and 
loudness  gives  2.3  bits,  we  might  hope 
to  get  as  much  as  4.8  bits  for  pitch  and 
loudness  together.  Pollack  obtained  3.1 
bits,  which  again  indicates  that  the 
second  dimension  augments  the  channel 
capacity  but  not  so  much  as  it  might. 

A  fourth  example  can  be  drawn  from 
the  work  of  Halsey  and  Chapanis  (9) 
on   confusions   among   colors   of   equal 
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luminance.  Although  they  did  not  ana- 
lyze their  results  in  informational  terms, 
they  estimate  that  there  are  about  1 1  to 
15  identifiable  colors,  or,  in  our  terms, 
about  3.6  bits.  Since  these  colors  varied 
in  both  hue  and  saturation,  it  is  prob- 
ably correct  to  regard  this  as  a  two- 
dimensional  judgment.  If  we  compare 
this  with  Eriksen's  3.1  bits  for  hue 
(which  is  a  questionable  comparison  to 
draw),  we  again  have  something  less 
than  perfect  addition  when  a  second 
dimension  is  added. 

It  is  still  a  long  way,  however,  from 
these  two-dimensional  examples  to  the 
multidimensional  stimuli  provided  by 
faces,  words,  etc.  To  fill  this  gap  we 
have  only  one  experiment,  an  auditory 
study  done  by  Pollack  and  Picks  (19). 
They  managed  to  get  six  different  acous- 
tic variables  that  they  could  change: 
frequency,  intensity,  rate  of  interrup- 
tion, on-time  fraction,  total  duration, 
and  spatial  location.  Each  one  of  these 
six  variables  could  assume  any  one  of 
five  different  values,  so  altogether  there 
were  5^,  or  15,625  different  tones  that 
they  could  present.  The  listeners  made 
a  separate  rating  for  each  one  of  these 
six  dimensions.  Under  these  conditions 
the  transmitted  information  was  7.2  bits, 
which  corresponds  to  about  150  differ- 
ent categories  that  could  be  absolutely 
identified  without  error.  Now  we  are 
beginning  to  get  up  into  the  range  that 
ordinary  experience  would  lead  us  to 
expect. 

Suppose  that  we  plot  these  data, 
fragmentary  as  they  are,  and  make  a 
guess  about  how  the  channel  capacity 
changes  with  the  dimensionality  of  the 
stimuli.  The  result  is  given  in  Fig.  6. 
In  a  moment  of  considerable  daring  I 
sketched  the  dotted  line  to  indicate 
roughly  the  trend  that  the  data  seemed 
to  be  taking. 

Clearly,  the  addition  of  independently 
variable  attributes  to  the  stimulus  in- 
creases the  channel  capacity,  but  at  a 
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Fig.  6.  The  general  form  of  the  relation  be- 
tween channel  capacity  and  the  number  of  in- 
dependently variable  attributes  of  the  stimuli. 

decreasing  rate.  It  is  interesting  to 
note  that  the  channel  capacity  is  in- 
creased even  when  the  several  variables 
are  not  independent.  Eriksen  (5)  re- 
ports that,  when  size,  brightness,  and 
hue  all  vary  together  in  perfect  correla- 
tion, the  transmitted  information  is  4.1 
bits  as  compared  with  an  average  of 
about  2.7  bits  when  these  attributes  are 
varied  one  at  a  time.  By  confounding 
three  attributes,  Eriksen  increased  the 
dimensionality  of  the  input  without  in- 
creasing the  amount  of  input  informa- 
tion; the  result  was  an  increase  in  chan- 
nel capacity  of  about  the  amount  that 
the  dotted  function  in  Fig.  6  would  lead 
us  to  expect. 

The  point  seems  to  be  that,  as  we 
add  more  variables  to  the  display,  we 
increase  the  total  capacity,  but  we  de- 
crease the  accuracy  for  any  particular 
variable.  In  other  words,  we  can  make 
relatively  crude  judgments  of  several 
things  simultaneously. 

We  might  argue  that  in  the  course  of 
evolution  those  organisms  were  most 
successful  that  were  responsive  to  the 
widest  range  of  stimulus  energies  in 
their  environment.  In  order  to  survive 
in  a  constantly  fluctuating  world,  it  was 
better  to  have  a  little  information  about 
a  lot  of  things  than  to  have  a  lot  of  in- 
formation about  a  small  segment  of  the 
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environment.  If  a  compromise  was  nec- 
essary, the  one  we  seem  to  have  made  is 
clearly  the  more  adaptive. 

Pollack  and  Ficks's  results  are  very 
strongly  suggestive  of  an  argument  that 
linguists  and  phoneticians  have  been 
making  for  some  time  (11).  According 
to  the  linguistic  analysis  of  the  sounds 
of  human  speech,  there  are  about  eight 
or  ten  dimensions — the  linguists  call 
them  distinctive  features — that  distin- 
guish one  phoneme  from  another.  These 
distinctive  features  are  usually  binary, 
or  at  most  ternary,  in  nature.  For  ex- 
ample, a  binary  distinction  is  made  be- 
tween vowels  and  consonants,  a  binary 
decision  is  made  between  oral  and  nasal 
consonants,  a  ternary  decision  is  made 
among  front,  middle,  and  back  pho- 
nemes, etc.  This  approach  gives  us 
quite  a  different  picture  of  speech  per- 
ception than  we  might  otherwise  obtain 
from  our  studies  of  the  speech  spectrum 
and  of  the  ear's  ability  to  discriminate 
relative  differences  among  pure  tones. 
I  am  personally  much  interested  in  this 
new  approach  (15),  and  I  regret  that 
there  is  not  time  to  discuss  it  here. 

It  was  probably  with  this  linguistic 
theory  in  mind  that  Pollack  and  Ficks 
conducted  a  test  on  a  set  of  tonal 
stimuli  that  varied  in  eight  dimensions, 
but  required  only  a  binary  decision  on 
each  dimension.  With  these  tones  they 
measured  the  transmitted  information 
at  6.9  bits,  or  about  120  recognizable 
kinds  of  sounds.  It  is  an  intriguing 
question,  as  yet  unexplored,  whether 
one  can  go  on  adding  dimensions  in- 
definitely in  this  way. 

In  human  speech  there  is  clearly  a 
limit  to  the  number  of  dimensions  that 
we  use.  In  this  instance,  however,  it  is 
not  known  whether  the  limit  is  imposed 
by  the  nature  of  the  perceptual  ma- 
chinery that  must  recognize  the  sounds 
or  by  the  nature  of  the  speech  ma- 
chinery that  must  produce  them.  Some- 
body will  have  to  do  the  experiment  to 


find  out.  There  is  a  limit,  however,  at 
about  eight  or  nine  distinctive  features 
in  every  language  that  has  been  studied, 
and  so  when  we  talk  we  must  resort  to 
still  another  trick  for  increasing  our 
channel  capacity.  Language  uses  se- 
quences of  phonemes,  so  we  make  sev- 
eral judgments  successively  when  we 
listen  to  words  and  sentences.  That  is 
to  say,  we  use  both  simultaneous  and 
successive  discriminations  in  order  to 
expand  the  rather  rigid  limits  imposed 
by  the  inaccuracy  of  our  absolute  judg- 
ments of  simple  magnitudes. 

These  multidimensional  judgments  are 
strongly  reminiscent  of  the  abstraction 
experiment  of  Kiilpe  (14).  As  you  may 
remember,  Kiilpe  showed  that  observers 
report  more  accurately  on  an  attribute 
for  which  they  are  set  than  on  attributes 
for  which  they  are  not  set.  For  exam- 
ple. Chapman  (4)  used  three  different 
attributes  and  compared  the  results  ob- 
tained when  the  observers  were  in- 
structed before  the  tachistoscopic  pres- 
entation with  the  results  obtained  when 
they  were  not  told  until  after  the  pres- 
entation which  one  of  the  three  attri- 
butes was  to  be  reported.  When  the 
instruction  was  given  in  advance,  the 
judgments  were  more  accurate.  When 
the  instruction  was  given  afterwards, 
the  subjects  presumably  had  to  judge  all 
three  attributes  in  order  to  report  on 
any  one  of  them  and  the  accuracy  was 
correspondingly  lower.  This  is  in  com- 
plete accord  with  the  results  we  have 
just  been  considering,  where  the  ac- 
curacy of  judgment  on  each  attribute 
decreased  as  more  dimensions  were 
added.  The  point  is  probably  obvious, 
but  I  shall  make  it  anyhow,  that  the 
abstraction  experiments  did  not  demon- 
strate that  people  can  judge  only  one 
attribute  at  a  time.  They  merely  showed 
what  seems  quite  reasonable,  that  peo- 
ple are  less  accurate  if  they  must  judge 
more  than  one  attribute  simultaneously. 
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SUBITIZING 

I  cannot  leave  this  general  area  with- 
out mentioning,  however  briefly,  the  ex- 
periments conducted  at  Mount  Holyoke 
College  on  the  discrimination  of  num- 
ber (12).  In  experiments  by  Kaufman, 
Lord,  Reese,  and  Volkmann  random 
patterns  of  dots  were  flashed  on  a  screen 
for  %  of  a  second.  Anywhere  from  1 
to  more  than  200  dots  could  appear  in 
the  pattern.  The  subject's  task  was  to 
report  how  many  dots  there  were. 

The  first  point  to  note  is  that  on  pat- 
terns containing  up  to  five  or  six  dots 
the  subjects  simply  did  not  make  errors. 
The  performance  on  these  small  num- 
bers of  dots  was  so  different  from  the 
performance  with  more  dots  that  it  was 
given  a  special  name.  Below  seven  the 
subjects  were  said  to  subitize;  above 
seven  they  were  said  to  estimate.  This 
is,  as  you  will  recognize,  what  we  once 
optimistically  called  "the  span  of  atten- 
tion." 

This  discontinuity  at  seven  is,  of 
course,  suggestive.  Is  this  the  same 
basic  process  that  limits  our  unidimen- 
sional  judgments  to  about  seven  cate- 
gories? The  generalization  is  tempting, 
but  not  sound  in  my  opinion.  The  data 
on  number  estimates  have  not  been  ana- 
lyzed in  informational  terms;  but  on 
the  basis  of  the  published  data  I  would 
guess  that  the  subjects  transmitted 
something  more  than  four  bits  of  in- 
formation about  the  number  of  dots. 
Using  the  same  arguments  as  before,  we 
would  conclude  that  there  are  about  20 
or  30  distinguishable  categories  of  nu- 
merousness.  This  is  considerably  more 
information  than  we  would  expect  to 
qet  from  a  unidimensional  display.  It 
is,  as  a  matter  of  fact,  very  much  like  a 
two-dimensional  display.  Although  the 
dimensionality  of  the  random  dot  pat- 
terns is  not  entirely  clear,  these  results 
are  in  the  same  range  as  Klemmer  and 
Prick's  for  their  two-dimensional  dis- 
play of  dots  in  a  square.     Perhaps  the 


two  dimensions  of  numerousness  are 
area  and  density.  When  the  subject 
can  subitize,  area  and  density  may  not 
be  the  significant  variables,  but  when 
the  subject  must  estimate  perhaps  they 
are  significant.  In  any  event,  the  com- 
parison is  not  so  simple  as  it  might 
seem  at  first  thought. 

This  is  one  of  the  ways  in  which  the 
magical  number  seven  has  persecuted 
me.  Here  we  have  two  closely  related 
kinds  of  experiments,  both  of  which 
point  to  the  significance  of  the  number 
seven  as  a  limit  on  our  capacities.  And 
yet  when  we  examine  the  matter  more 
closely,  there  seems  to  be  a  reasonable 
suspicion  that  it  is  nothing  more  than 
a  coincidence. 

The  Span  of  Immediate  Memory 

Let  me  summarize  the  situation  in 
this  way.  There  is  a  clear  and  definite 
limit  to  the  accuracy  with  which  we  can 
identify  absolutely  the  magnitude  of 
a  unidimensional  stimulus  variable.  I 
would  propose  to  call  this  limit  the 
span  of  absolute  judgment,  and  I 
maintain  that  for  unidimensional  judg- 
ments this  span  is  usually  somewhere 
in  the  neighborhood  of  seven.  We  are 
not  completely  at  the  mercy  of  this 
limited  span,  however,  because  we  have 
a  variety  of  techniques  for  getting 
around  it  and  increasing  the  accuracy 
of  our  judgments.  The  three  most  im- 
portant of  these  devices  are  (a)  to 
make  relative  rather  than  absolute  judg- 
ments; or,  if  that  is  not  possible,  (b) 
to  increase  the  number  of  dimensions 
along  which  the  stimuli  can  differ;  or 
(c)  to  arrange  the  task  in  such  a  way 
that  we  make  a  sequence  of  several  ab- 
solute judgments  in  a  row. 

The  study  of  relative  judgments  is 
one  of  the  oldest  topics  in  experimental 
psychology,  and  I  will  not  pause  to  re- 
view it  now.  The  second  device,  in- 
creasing the  dimensionality,  we  have  just 
considered.     It  seems  that  by  adding 
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more  dimensions  and  requiring  crude, 
binary,  yes-no  judgments  on  each  at- 
tribute we  can  extend  the  span  of  abso- 
lute judgment  from  seven  to  at  least 
150.  Judging  from  our  everyday  be- 
havior, the  limit  is  probably  in  the 
thousands,  if  indeed  there  is  a  limit.  In 
my  opinion,  we  cannot  go  on  compound- 
ing dimensions  indefinitely.  I  suspect 
that  there  is  also  a  span  of  perceptual 
dimensionality  and  that  this  span  is 
somewhere  in  the  neighborhood  of  ten, 
but  I  must  add  at  once  that  there  is  no 
objective  evidence  to  support  this  sus- 
picion. This  is  a  question  sadly  need- 
ing experimental  exploration. 

Concerning  the  third  device,  the  use 
of  successive  judgments,  I  have  quite  a 
bit  to  say  because  this  device  introduces 
memory  as  the  handmaiden  of  discrimi- 
nation. And,  since  mnemonic  processes 
are  at  least  as  complex  as  are  perceptual 
processes,  we  can  anticipate  that  their 
interactions  will  not  be  easily  disen- 
tangled. 

Suppose  that  we  start  by  simply  ex- 
tending slightly  the  experimental  pro- 
cedure that  we  have  been  using.  Up 
to  this  point  we  have  presented  a  single 
stimulus  and  asked  the  observer  to  name 
it  immediately  thereafter.  We  can  ex- 
tend this  procedure  by  requiring  the  ob- 
server to  withhold  his  response  until  we 
have  given  him  several  stimuli  in  suc- 
cession. At  the  end  of  the  sequence  of 
stimuli  he  then  makes  his  response.  We 
still  have  the  same  sort  of  input-out- 
put situation  that  is  required  for  the 
measurement  of  transmitted  informa- 
tion. But  now  we  have  passed  from 
an  experiment  on  absolute  judgment  to 
what  is  traditionally  called  an  experi- 
ment on  immediate  memory. 

Before  we  look  at  any  data  on  this 
topic  I  feel  I  must  give  you  a  word  of 
warning  to  help  you  avoid  some  obvi- 
ous associations  that  can  be  confusing. 
Everybody  knows  that  there  is  a  finite 
span  of  immediate  memory  and  that  for 


a  lot  of  different  kinds  of  test  materials 
this  span  is  about  seven  items  in  length. 
I  have  just  shown  you  that  there  is  a 
span  of  absolute  judgment  that  can  dis- 
tinguish about  seven  categories  and  that 
there  is  a  span  of  attention  that  will 
encompass  about  six  objects  at  a  glance. 
What  is  more  natural  than  to  think  that 
all  three  of  these  spans  are  different  as- 
pects of  a  single  underlying  process? 
And  that  is  a  fundamental  mistake,  as 
I  shall  be  at  some  pains  to  demonstrate. 
This  mistake  is  one  of  the  malicious 
persecutions  that  the  magical  number 
seven  has  subjected  me  to. 

My  mistake  went  something  like  this. 
We  have  seen  that  the  invariant  fea- 
ture in  the  span  of  absolute  judgment 
is  the  amount  of  information  that  the 
observer  can  transmit.  There  is  a  real 
operational  similarity  between  the  ab- 
solute judgment  experiment  and  the 
immediate  memory  experiment.  If  im- 
mediate memory  is  like  absolute  judg- 
ment, then  it  should  follow  that  the  in- 
variant feature  in  the  span  of  immediate 
memory  is  also  the  amount  of  informa- 
tion that  an  observer  can  retain.  If  the 
amount  of  information  in  the  span  of 
immediate  memory  is  a  constant,  then 
the  span  should  be  short  when  the  indi- 
vidual items  contain  a  lot  of  informa- 
tion and  the  span  should  be  long  when 
the  items  contain  little  information.  For 
example,  decimal  digits  are  worth  i.2> 
bits  apiece.  We  can  recall  about  seven 
of  them,  for  a  total  of  23  bits  of  in- 
formation. Isolated  English  words  are 
worth  about  10  bits  apiece.  If  the  total 
amount  of  information  is  to  remain 
constant  at  2?)  bits,  then  we  should  be 
able  to  remember  only  two  or  three 
words  chosen  at  random.  In  this  way 
I  generated  a  theory  about  how  the  span 
of  immediate  memory  should  vary  as  a 
function  of  the  amount  of  information 
per  item  in  the  test  materials. 

The  measurements  of  memory  span  in 
the    literature    are    suggestive    on    this 
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question,  but  not  definitive.  And  so  it 
was  necessary  to  do  the  experiment  to 
see.  Hayes  (10)  tried  it  out  with  five 
different  kinds  of  test  materials:  binary 
digits,  decimal  digits,  letters  of  the  al- 
phabet, letters  plus  decimal  digits,  and 
with  1,000  monosyllabic  words.  The 
lists  were  read  aloud  at  the  rate  of  one 
item  per  second  and  the  subjects  had  as 
much  time  as  they  needed  to  give  their 
responses.  A  procedure  described  by 
Woodworth  (20)  was  used  to  score  the 
responses. 

The  results  are  shown  by  the  filled 
circles  in  Fig.  7.  Here  the  dotted  line 
indicates  what  the  span  should  have 
been  if  the  amount  of  information  in  the 
span  were  constant.  The  solid  curves 
represent  the  data.  Hayes  repeated  the 
experiment  using  test  vocabularies  of 
different  sizes  but  all  containing  only 
English  monosyllables  (open  circles  in 
Fig.  7).  This  more  homogeneous  test 
material  did  not  change  the  picture  sig- 
nificantly. With  binary  items  the  span 
is  about  nine  and,  although  it  drops  to 
about  five  with  monosyllabic  English 
words,  the  difference  is  far  less  than 
the  hypothesis  of  constant  information 
would  require. 
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Fig.  7.  Data  from  Hayes  (10)  on  the  span 
of  immediate  memory  plotted  as  a  function 
of  the  amount  of  information  per  item  in  the 
test  materials. 
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Fig.  8.  Data  from  Pollack  (16)  on  the 
amount  of  information  retained  after  one 
presentation  plotted  as  a  function  of  the 
amount  of  information  per  item  in  the  test 
materials. 

There  is  nothing  wrong  with  Hayes's 
experiment,  because  Pollack  (16)  re- 
peated it  much  more  elaborately  and 
got  essentially  the  same  result.  Pol- 
lack took  pains  to  measure  the  amount 
of  information  transmitted  and  did  not 
rely  on  the  traditional  procedure  for 
scoring  the  responses.  His  results  are 
plotted  in  Fig.  8.  Here  it  is  clear  that 
the  amount  of  information  transmitted 
is  not  a  constant,  but  increases  almost 
linearly  as  the  amount  of  information 
per  item  in  the  input  is  increased. 

And  so  the  outcome  is  perfectly  clear. 
In  spite  of  the  coincidence  that  the 
magical  number  seven  appears  in  both 
places,  the  span  of  absolute  judgment 
and  the  span  of  immediate  memory  are 
quite  different  kinds  of  limitations  that 
are  imposed  on  our  ability  to  process 
information.  Absolute  judgment  is  lim- 
ited by  the  amount  of  information.  Im- 
mediate memory  is  limited  by  the  num- 
ber of  items.  In  order  to  capture  this  dis- 
tinction in  somewhat  picturesque  terms, 
I  have  fallen  into  the  custom  of  distin- 
guishing between  hits  of  information 
and  chunks  of  information.  Then  I  can 
say  that  the  number  of  bits  of  informa- 
tion is  constant  for  absolute  judgment 
and  the  number  of  chunks  of  informa- 


GEORGE  A.    MILLER 


147 


tion  is  constant  for  immediate  memory. 
The  span  of  immediate  memory  seems 
to  be  almost  independent  of  the  number 
of  bits  per  chunk,  at  least  over  the 
range  that  has  been  examined  to  date. 
The  contrast  of  the  terms  bit  and 
chunk  also  serves  to  highlight  the  fact 
that  we  are  not  very  definite  about  what 
constitutes  a  chunk  of  information.  For 
example,  the  memory  span  of  five  words 
that  Hayes  obtained  when  each  word 
was  drawn  at  random  from  a  set  of  1000 
English  monosyllables  might  just  as  ap- 
propriately have  been  called  a  memory 
span  of  IS  phonemes,  since  each  word 
had  about  three  phonemes  in  it.  Intui- 
tively, it  is  clear  that  the  subjects  were 
recalling  five  words,  not  15  phonemes, 
but  the  logical  distinction  is  not  im- 
mediately apparent.  We  are  dealing 
here  with  a  process  of  organizing  or 
grouping  the  input  into  familiar  units 
or  chunks,  and  a  great  deal  of  learning 
has  gone  into  the  formation  of  these 
familiar  units. 

Recoding 

In  order  to  speak  more  precisely, 
therefore,  we  must  recognize  the  impor- 
tance of  grouping  or  organizing  the  in- 
put sequence  into  units  or  chunks. 
Since  the  memory  span  is  a  fixed  num- 
ber of  chunks,  we  can  increase  the  num- 
ber of  bits  of  information  that  it  con- 
tains simply  by  building  larger  and 
larger  chunks,  each  chunk  containing 
more  information  than  before. 

A  man  just  beginning  to  learn  radio- 
telegraphic  code  hears  each  dit  and  dah 
as  a  separate  chunk.  Soon  he  is  able 
to  organize  these  sounds  into  letters  and 
then  he  can  deal  with  the  letters  as 
chunks.  Then  the  letters  organize 
themselves  as  words,  which  are  still 
larger  chunks,  and  he  begins  to  hear 
whole  phrases.  I  do  not  mean  that  each 
step  is  a  discrete  process,  or  that  pla- 
teaus must  appear  in  his  learning  curve, 
for  surely  the  levels  of  organization  are 


achieved  at  different  rates  and  overlap 
each  other  during  the  learning  process. 
I  am  simply  pointing  to  the  obvious 
fact  that  the  dits  and  dahs  are  organ- 
ized by  learning  into  patterns  and  that 
as  these  larger  chunks  emerge  the 
amount  of  message  that  the  operator 
can  remember  increases  correspondingly. 
In  the  terms  I  am  proposing  to  use,  the 
operator  learns  to  increase  the  bits  per 
chunk. 

In  the  jargon  of  communication  the- 
ory, this  process  would  be  called  recod- 
ing. The  input  is  given  in  a  code  that 
contains  many  chunks  with  few  bits  per 
chunk.  The  operator  recodes  the  input 
into  another  code  that  contains  fewer 
chunks  with  more  bits  per  chunk.  There 
are  many  ways  to  do  this  recoding,  but 
probably  the  simplest  is  to  group  the 
input  events,  apply  a  new  name  to  the 
group,  and  then  remember  the  new  name 
rather  than  the  original  input  events. 

Since  I  am  convinced  that  this  proc- 
ess is  a  very  general  and  important  one 
for  psychology,  I  want  to  tell  you  about 
a  demonstration  experiment  that  should 
make  perfectly  explicit  what  I  am  talk- 
ing about.  This  experiment  was  con- 
ducted by  Sidney  Smith  and  was  re- 
ported by  him  before  the  Eastern  Psy- 
chological Association  in  1954. 

Begin  with  the  observed  fact  that  peo- 
ple can  repeat  back  eight  decimal  digits, 
but  only  nine  binary  digits.  Since  there 
is  a  large  discrepancy  in  the  amount  of 
information  recalled  in  these  two  cases, 
we  suspect  at  once  that  a  recoding  pro- 
cedure could  be  used  to  increase  the 
span  of  immediate  memory  for  binary 
digits.  In  Table  1  a  method  for  group- 
ing and  renaming  is  illustrated.  Along 
the  top  is  a  sequence  of  18  binary  digits, 
far  more  than  any  subject  was  able  to 
recall  after  a  single  presentation.  In 
the  next  line  these  same  binary  digits 
are  grouped  by  pairs.  Four  possible 
pairs  can  occur:  00  is  renamed  0,  01  is 
renamed  1,  10  is  renamed  2,  and  11  is 
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TABLE  1 
Ways  of  Recoding  Sequences  of  Binary  Digits 


Binary  Digits  (Bits) 


101000100111001110 


2 : 1     Chunks 
Recoding 

3 : 1     Chunks 
Recoding 

4:1     Chunks 
Recoding 

5 : 1     Chunks 
Recoding 


10       10       00       10  01       11       00       11        10 

220         2  13032 

101        000           100  111           001          110 

5            0               4  7               16 

1010             0010  0111            0011             10 

10                  2  7                  3 


10100 
20 


01001 
9 


11001 

25 


110 


renamed  3.  That  is  to  say,  we  recede 
from  a  base-two  arithmetic  to  a  base- 
four  arithmetic.  In  the  receded  se- 
quence there  are  now  just  nine  digits  to 
remember,  and  this  is  almost  within  the 
span  of  immediate  memory.  In  the  next 
line  the  same  sequence  of  binary  digits 
is  regrouped  into  chunks  of  three.  There 
are  eight  possible  sequences  of  three,  so 
we  give  each  sequence  a  new  name  be- 
tween 0  and  7.  Now  we  have  recoded 
from  a  sequence  of  18  binary  digits 
into  a  sequence  of  6  octal  digits,  and 
this  is  well  within  the  span  of  immedi- 
ate memory.  In  the  last  two  lines  the 
binary  digits  are  grouped  by  fours  and 
by  fives  and  are  given  decimal-digit 
names  from  0  to  15  and  from  0  to  31. 
It  is  reasonably  obvious  that  this  kind 
of  recoding  increases  the  bits  per  chunk, 
and  packages  the  binary  sequence  into 
a  form  that  can  be  retained  within  the 
span  of  immediate  memory.  So  Smith 
assembled  20  subjects  and  measured 
their  spans  for  binary  and  octal  digits. 
The  spans  were  9  for  binaries  and  7  for 
octals.  Then  he  gave  each  recoding 
scheme  to  five  of  the  subjects.  They 
studied  the  recoding  until  they  said 
they  understood  it — for  about  5  or  10 
minutes.  Then  he  tested  their  span  for 
binary  digits  again  while  they  tried  to 
use  the  recoding  schemes  they  had 
studied. 


The  recoding  schemes  increased  their 
span  for  binary  digits  in  every  case. 
But  the  increase  was  not  as  large  as  we 
had  expected  on  the  basis  of  their  span 
for  octal  digits.  Since  the  discrepancy 
increased  as  the  recoding  ratio  increased, 
we  reasoned  that  the  few  minutes  the 
subjects  had  spent  learning  the  recod- 
ing schemes  had  not  been  sufficient. 
Apparently  the  translation  from  one 
code  to  the  other  must  be  almost  auto- 
matic or  the  subject  will  lose  part  of  the 
next  group  while  he  is  trying  to  remem- 
ber the  translation  of  the  last  group. 

Since  the  4 : 1  and  5 : 1  ratios  require 
considerable  study.  Smith  decided  to 
imitate  Ebbinghaus  and  do  the  experi- 
ment on  himself.  With  Germanic  pa- 
tience he  drilled  himself  on  each  recod- 
ing successively,  and  obtained  the  re- 
sults shown  in  Fig.  9.  Here  the  data 
follow  along  rather  nicely  with  the  re- 
sults you  would  predict  on  the  basis  of 
his  span  for  octal  digits.  He  could  re- 
member 12  octal  digits.  With  the  2:1 
recoding,  these  12  chunks  were  worth 
24  binary  digits.  With  the  3:1  recod- 
ing they  were  worth  36  binary  digits. 
With  the  4 : 1  and  5 : 1  recodings,  they 
were  worth  about  40  binary  digits. 

It  is  a  little  dramatic  to  watch  a  per- 
son get  40  binary  digits  in  a  row  and 
then  repeat  them  back  without  error. 
However,  if  you  think  of  this  merely  as 
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Fig.  9.  The  span  of  immediate  memory  for 
binary  digits  is  plotted  as  a  function  of  the 
recoding  procedure  used.  The  predicted  func- 
tion is  obtained  by  multiplying  the  span  for 
octals  by  2,  3  and  3.3  for  recoding  into  base 
4,  base  8,  and  base  10,  respectively. 

a  mnemonic  trick  for  extending  the 
memory  span,  you  will  miss  the  more 
important  point  that  is  implicit  in 
nearly  all  such  mnemonic  devices.  The 
point  is  that  recoding  is  an  extremely 
powerful  weapon  for  increasing  the 
amount  of  information  that  we  can 
deal  with.  In  one  form  or  another  we 
use  recoding  constantly  in  our  daily 
behavior. 

In  my  opinion  the  most  customary 
kind  of  recoding  that  we  do  all  the  time 
is  to  translate  into  a  verbal  code.  When 
there  is  a  story  or  an  argument  or  an 
idea  that  we  want  to  remember,  we  usu- 
ally try  to  rephrase  it  "in  our  own 
words."  When  we  witness  some  event 
we  want  to  remember,  we  make  a  verbal 
description  of  the  event  and  then  re- 
member our  verbalization.  Upon  recall 
we  recreate  by  secondary  elaboration 
the  details  that  seem  consistent  with 
the  particular  verbal  recoding  we  hap- 
pen to  have  made.  The  well-known  ex- 
periment by  Carmichael,  Hogan,  and 
Walter  (3)  on  the  influence  that  names 
have  on  the  recall  of  visual  figures  is 
one  demonstration  of  the  process. 

The  inaccuracy  of  the  testimony  of 


eyewitnesses  is  well  known  in  legal  psy- 
chology, but  the  distortions  of  testi- 
mony are  not  random — they  follow 
naturally  from  the  particular  recoding 
that  the  witness  used,  and  the  particu- 
lar recoding  he  used  depends  upon  his 
whole  life  history.  Our  language  is  tre- 
mendously useful  for  repackaging  ma- 
terial into  a  few  chunks  rich  in  infor- 
mation. I  suspect  that  imagery  is  a 
form  of  recoding,  too,  but  images  seem 
much  harder  to  get  at  operationally  and 
to  study  experimentally  than  the  more 
symbolic  kinds  of  recoding. 

It  seems  probable  that  even  memori- 
zation can  be  studied  in  these  terms. 
The  process  of  memorizing  may  be  sim- 
ply the  formation  of  chunks,  or  groups 
of  items  that  go  together,  until  there 
are  few  enough  chunks  so  that  we  can 
recall  all  the  items.  The  work  by  Bous- 
iield  and  Cohen  (2)  on  the  occurrence 
of  clustering  in  the  recall  of  words  is 
especially  interesting  in  this  respect. 

Summary 

I  have  come  to  the  end  of  the  data 
that  I  wanted  to  present,  so  I  would 
like  now  to  make  some  summarizing  re- 
marks. 

First,  the  span  of  absolute  judgment 
and  the  span  of  immediate  memory  im- 
pose severe  limitations  on  the  amount 
of  information  that  we  are  able  to  re- 
ceive, process,  and  remember.  By  or- 
ganizing the  stimulus  input  simultane- 
ously into  several  dimensions  and  suc- 
cessively into  a  sequence  of  chunks,  we 
manage  to  break  (or  at  least  stretch) 
this  informational  bottleneck. 

Second,  the  process  of  recoding  is  a 
very  important  one  in  human  psychol- 
ogy and  deserves  much  more  explicit  at- 
tention than  it  has  received.  In  par- 
ticular, the  kind  of  linguistic  recoding 
that  people  do  seems  to  me  to  be  the 
very  lifeblood  of  the  thought  processes. 
Recoding  procedures  are  a  constant 
concern   to   clinicians,   social   psycholo- 
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gists,  linguists,  and  anthropologists  and 
yet,  probably  because  recoding  is  less 
accessible  to  experimental  manipulation 
than  nonsense  syllables  or  T  mazes,  the 
traditional  experimental  psychologist  has 
contributed  little  or  nothing  to  their 
analysis.  Nevertheless,  experimental 
techniques  can  be  used,  methods  of  re- 
coding  can  be  specified,  behavioral  in- 
dicants can  be  found.  And  I  anticipate 
that  we  will  find  a  very  orderly  set  of 
relations  describing  what  now  seems  an 
uncharted  wilderness  of  individual  dif- 
ferences. 

Third,  the  concepts  and  measures 
provided  by  the  theory  of  information 
provide  a  quantitative  way  of  getting  at 
some  of  these  questions.  The  theory 
provides  us  with  a  yardstick  for  cali- 
brating our  stimulus  materials  and  for 
measuring  the  performance  of  our  sub- 
jects. In  the  interests  of  communica- 
tion I  have  suppressed  the  technical  de- 
tails of  information  measurement  and 
have  tried  to  express  the  ideas  in  more 
familiar  terms;  I  hope  this  paraphrase 
will  not  lead  you  to  think  they  are  not 
useful  in  research.  Informational  con- 
cepts have  already  proved  valuable  in 
the  study  of  discrimination  and  of  lan- 
guage; they  promise  a  great  deal  in  the 
study  of  learning  and  memory;  and  it 
has  even  been  proposed  that  they  can 
be  useful  in  the  study  of  concept  for- 
mation. A  lot  of  questions  that  seemed 
fruitless  twenty  or  thirty  years  ago  may 
now  be  worth  another  look.  In  fact,  I 
feel  that  my  story  here  must  stop  just 
as  it  begins  to  get  really  interesting. 

And  finally,  what  about  the  magical 
number  seven?  What  about  the  seven 
wonders  of  the  world,  the  seven  seas, 
the  seven  deadly  sins,  the  seven  daugh- 
ters of  Atlas  in  the  Pleiades,  the  seven 
ages  of  man,  the  seven  levels  of  hell, 
the  seven  primary  colors,  the  seven  notes 
of  the  musical  scale,  and  the  seven  days 
of  the  week?  What  about  the  seven- 
point  rating  scale,  the  seven  categories 


for  absolute  judgment,  the  seven  ob- 
jects in  the  span  of  attention,  and  the 
seven  digits  in  the  span  of  immediate 
memory?  For  the  present  I  propose  to 
withhold  judgment.  Perhaps  there  is 
something  deep  and  profound  behind  all 
these  sevens,  something  just  calling  out 
for  us  to  discover  it.  But  I  suspect 
that  it  is  only  a  pernicious,  Pythagorean 
coincidence. 
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REMARKS  ON  THE  METHOD  OF  PAIRED  COMPARISONS: 

I.  THE  LEAST  SQUARES  SOLUTION  ASSUMING 

EQUAL  STANDARD  DEVIATIONS 

AND  EQUAL  CORRELATIONS* 

Frederick  Mosteller 

HARVARD  UNIVERSITY 

Thurstone's  Case  V  of  the  method  of  paired  comparisons  as- 
sumes equal  standard  deviations  of  sensations  corresponding  to 
stimuli  and  zero  correlations  between  pairs  of  stimuli  sensations. 
It  is  shown  that  the  assumption  of  zero  correlations  can  be  relaxed 
to  an  assumption  of  equal  correlations  between  pairs  with  no  change 
in  method.  Further  the  usual  approach  to  the  method  of  paired  com- 
parisons Case  V  is  shown  to  lead  to  a  least  squares  estimate  of  the 
stimulus  positions  on  the  sensation  scale. 

1.  Introdtiction.  The  fundamental  notions  underlying  Thur- 
stone's method  of  paired  comparisons  (4)  are  these: 

(1)  There  is  a  set  of  stimuli  which  can  be  located  on  a  sub- 
jective continuum  (a  sensation  scale,  usually  not  having  a  meas- 
urable physical  characteristic) . 

(2)  Each  stimulus  when  presented  to  an  individual  gives  rise 
to  a  sensation  in  the  individual. 

(3)  The  distribution  of  sensations  from  a  particular  stimulus 
for  a  population  of  individuals  is  normal. 

(4)  Stimuli  are  presented  in  pairs  to  an  individual,  thus  giv- 
ing rise  to  a  sensation  for  each  stimulus.  The  individual  com- 
pares these  sensations  and  reports  which  is  greater. 

(5)  It  is  possible  for  these  paired  sensations  to  be  correlated. 

(6)  Our  task  is  to  space  the  stimuli  (the  sensation  means),  ex- 
cept for  a  linear  transformation. 

*This  research  was  performed  in  the  Laboratory  of  Social  Relations  under 
a  grant  made  available  to  Harvard  University  by  the  RAND  Corporation  under 
the  Department  of  the  Air  Force,  Project  RAND. 

This  article  appeared  in  Psychometrika,  1951,  16,  3-9.    Reprinted  with  permission. 
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There  are  numerous  variations  of  the  basic  materials  used  in 
the  analysis— for  example,  we  may  not  have  n  different  individuals, 
but  only  one  individual  who  makes  all  comparisons  several  times ;  or 
several  individuals  may  make  all  comparisons  several  times;  the  in- 
dividuals need  not  be  people. 

Furthermore,  there  are  "cases"  to  be  discussed— for  example, 
shall  we  assume  all  the  intercorrelations  equal,  or  shall  we  assume 
them  zero?  Shall  we  assume  the  standard  deviations  of  the  sensa- 
tion distributions  equal  or  not? 

The  case  which  has  been  discussed  most  fully  is  known  as  Thur- 
stone's  Case  V.  Thurstone  has  assumed  in  this  case  that  the  stand- 
ard deviations  of  the  sensation  distributions  are  equal  and  that  the 
correlations  between  pairs  of  stimulus  sensations  are  zero.  We  shall 
discuss  a  standard  method  of  ordering  the  stimuli  for  this  Case  V. 
Case  V  has  been  employed  quite  frequently  and  seems  to  fit  empirical 
data  rather  well  in  the  sense  of  reproducing  the  original  proportions 
of  the  paired  comparison  table.  The  assumption  of  equal  standard 
deviations  is  a  reasonable  first  approximation.  We  will  not  stick  to 
the  assumption  of  zero  correlations,  because  this  does  not  seem  to  be 
essential  for  Case  V. 

2.  Ordering  Stimuli  tvith  Error-Free  Data.  We  assume  there 
are  a  number  of  objects  or  stimuli,  Oi ,  O2 ,  •••  ,  On.  These  stimuli 
give  rise  to  sensations  which  lie  on  a  single  sensation  continuum  S . 
If  Xi  and  Xj  are  single  sensations  evoked  in  an  individual  /  by  the 
ith  and  jth  stimuli,  then  we  assume  Xi  and  Xj  to  be  jointly  normally 
distributed  for  the  population  of  individuals  with 

mean  of  Xi  =  Si  (i  =  1 ,  2  ,  •  •  • ,  w) 

variance  of  Zi  =  <rMXi)  =<r2  (i  =  l  ,2  ,"•  ,n)  (1) 

correlation  of  Xi  and  Xj  =  pij  =  p     (i,j  =  1 ,  2  ,  •  •  • ,  n) . 

The  marginal  distributions  of  the  Xi's  appear  as  in  Figure  1. 


Figure  1 
The   Marginal   Distributions   of  the   Sensations   Produced   by  the   Separate 
Stimuli  in  Thurstone's  Case  V  of  the  Method  of  Paired  Comparisons. 
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The  figure  indicates  the  possibility  that  X2  <  Zi ,  even  though  5i  <  Sa . 
In  fact  this  has  to  happen  part  of  the  time  if  we  are  to  build  any- 
thing more  than  a  rank-order  scale. 

An  individual  /  compares  Oi  and  Oj  and  reports  whether 
Xi  ^  Xj  (no  ties  are  allowed). 

We  can  best  see  the  tenor  of  the  method  for  ordering  the  stimuli 
if  we  first  work  through  the  problem  in  the  case  of  nontallible  data. 
For  the  case  of  nonlallible  data  we  assume  we  know  the  true  propor- 
tion of  the  time  Xi  exceeds  Xj ,  and  that  the  conditions  given  above 
(1)  are  exactly  fulfilled. 

Our  problem  is  to  find  the  spacing  of  the  stimuli  (or  the  spacing 
of  the  mean  sensations  produced  by  them,  the  Si  ••■  Sn  points  in  Fig- 
ure 1).  Clearly  we  cannot  hope  to  do  this  except  within  a  linear 
transformation,  for  the  data  reported  are  merely  the  percentages  of 
times  Xi  exceeds  Xj ,  say  pa  . 


-ldij-{Si-Sj)y 
Pij  =  PiXi>Xj)= I       e ddiji2) 


V27la{dij)    J,  2aHdij) 

where  da  =  Xi  —  Xj ,  and  aHdtj)  =  2a2(l  —  p).   There  will  be  no 
loss  in  generality  in  assigning  the  scale  factor  so  that 

2aHl-p)=l.  (3) 

It  is  at  this  point  that  we  depart  slightly  from  Thurstone,  who  char- 
acterized Case  V  as  having  equal  variances  and  zero  correlations. 
However,  his  derivations  only  assume  the  correlations  are  zero  ex- 
plicitly (and  artificially),  but  are  carried  through  implicitly  with 
equal  correlations  (not  necessarily  zero).  Actually  this  is  a  great 
easing  of  conditions.  We  can  readily  imagine  a  set  of  attitudinal 
items  on  the  same  continuum  correlated  .34 ,  .38  ,  .42 ,  i.e.,  nearly 
equal.  But  it  is  difficult  to  imagine  them  all  correlated  zero  with  one 
another.  Past  uses  of  this  method  have  all  benefited  from  the  fact 
that  items  were  not  really  assumed  to  be  uncorrelated.  It  was  only 
stated  that  the  model  assumed  the  items  were  uncorrelated,  but  the 
model  was  unable  to  take  cognizance  of  the  statement.  Guttman  (2) 
has  noticed  this  independently. 

With  the  scale  factor  chosen  in  equation   (3),  we  can  rewrite 
equation  (2) 
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-^y'dy.  (4) 


1       /»" 

V2-  /,._., 


From  (4),  given  any  pa  we  can  solve  for  —{SiSj)  by  use  of  a 
normal  table  of  areas.  Then  if  we  arbitrarily  assign  as  a  location 
parameter  <S\  =  0  ,  we  can  compute  all  other  Si .  Thus  given  the  pa 
matrix  we  can  find  the  Si .  The  problem  with  fallible  data  is  more 
complicated. 

3.  Paired  Comparison  Scaling  with  Fallible  Data.  When  we 
have  fallible  data,  we  have  p'a  which  are  estimates  of  the  true  pa  . 
Analogous  to  equation  (4)  we  have 


1        /•" 

p'ii  =  —z:^  I      e-'y'dy. 


(5) 


where  the  D'i,  are  estimates  of  Da  =  Si—Sj .  We  merely  look  up  the 
normal  deviate  corresponding  to  p'a  to  get  the  matrix  of  D'ij .  We 
notice  further  that  the  D'a  need  not  be  consistent  in  the  sense  that 
the  Dij  were ;  i.e., 

Dij  +  Djh  =^ Si  —  Sj  +  Sj  —  Sk^^ Dik  , 
does  not  hold  for  the  D'a  . 

We  conceive  the  problem  as  follows:  from  the  D'ij  to  construct 
a  set  of  estimates  of  the  Si's  called  S'i,  such  that 

2  =  2  [D'ij  —  (S'i  —  S'j)  y     is  to  be  a  minimum.  (6) 

i,i 

It  will  help  to  indicate  another  form  of  solution  for  nonfallible 
data.  One  can  set  up  the  Si  —  Sj  matrix: 

MATRIX  OF  Si  —  Si 


1 

2 

3 

n 

1 

s,-s. 

Oi          O2 

Oi           O3 

5,-5„ 

2 

O2           Oi 

02          02 

JO2            1^3 

Si         Sn 

3 

s,-s. 

S3           Sz 

S^-Ss 

Ss-Sn 

n  On         S-i  On         S2  Sn         S3  Sn         (Sn 

Totals     2'Si-nSx    25i  -  WS2  25i  —  w^a  lSi  —  nS„ 

Means       5  — S,  S  —  S2        S  —  S3  S  —  Sn 
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Now  by  setting  Si  =  0 ,  we  get  S2  =  (5  —  S^)  —  (S  —  S2),  Ss  ^ 
(S  —  Si)  —  (S  —  S3),  and  so  on.  We  will  use  this  plan  shortly  for 
the  S'i . 

If  we  wish  to  minimize  expression  (6)  we  take  the  partial  de- 
rivative with  respect  to  S'i .  Since  D'^  —  —D'ji  and  S'i  —  S'j  ^= 
—  (S'j  —  S'i)  and  DH  =  S'i  —  S'i  =  0 ,  we  need  only  concern  our- 
selves with  the  sum  of  squares  from  above  the  main  diagonal  in  the 
D'ij  —  (S'i  —  S'j)  matrix,  i.e.,  terms  for  which  i  <  j .  Differentiat- 
ing with  respect  to  S'i  we  get: 


9(2/2) 


=  2 


dS'i 


2  (D'ji  -  S'j  +  S'i)-^   (D'ij  -  S'i  +  S'j) 

j=l  j=i+l 


{i=l,2,---,n). 

Setting  this  partial  derivative  equal  to  zero  we  have 

+S'i  +  S'.  •••  +S'i-i—  (n—l)S'i  +  S'i+i  +  •••  +S'n 


(7) 


i-l  n 

•  ^  -Lf  ji  ^    -Lf  ij 

j-l  j=i+l 


(8) 


{i=l,2,--,n)  , 
but  D'ij  =  — D'ji ,  and  D'a  =  0 ;  this  makes  the  right  side  of  (8) 


2  D'ji  +  D'a  +  2  D'ji  =  2  D'ji 

j=l  i=i+l  j=l 


Thus  (8)  can  be  written 


^S'j-nS'i  =  ^D'j 


(i  =  l,2,---,n). 


(9) 


The  determinant  of  the  coefficients  of  the  left  side  of  (9)  van- 
ishes. This  is  to  be  expected  because  we  have  only  chosen  our  scale 
and  have  not  assigned  a  location  parameter.  There  are  various  ways 
to  assign  this  location  parameter,  for  example,  by  setting  S'  =  0  or 
by  setting  5"i  =  0  .  We  choose  to  set  S'i  =  0  .  This  means  we  will 
measure  distances  from  S'i .  Then  we  try  the  solution  (10)  which  is 
suggested  by  the  similarity  of  the  left  side  of  (9)  to  the  total  col- 
umn in  the  matrix  of  Si  —  Sj . 


S'i  =  ^D'ji/n  —  ^D'ji/n. 
j=i  j=i 


(10) 
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Notice  that  when  i^l ,  S'i  =  0  and  that 

n  n 

because 

i     ;' 

which  happens  because  every  term  and  its  negative  appear  in  this 
double  sum.  Therefore,  substituting  (10)  in  the  left  side  of  (9)  we 
have 


2i5';i-n 


^D'j,/n-:ZD'ji/n 


=  ^D'r.,  (11) 


;=l 


which  is  an  identity,  and  the  equations  are  solved.  Of  course,  any 
linear  transformation  of  the  solutions  is  equally  satisfactory. 

The  point  of  this  presentation  is  to  provide  a  background  for 
the  theory  of  paired  comparisons,  to  indicate  that  the  assumption  of 
zero  correlations  is  unnecessary,  and  to  show  that  the  customary 
solution  to  paired  comparisons  is  a  least  squares  solution  in  the 
sense  of  condition  (6).  That  this  is  a  least  squares  solution  seems 
not  to  be  mentioned  in  the  literature  although  it  may  have  been 
known  to  Horst  (3),  since  he  worked  closely  along  these  lines. 

This  least  squares  solution  is  not  entirely  satisfactory  because 
the  p'ij  tend  to  zero  and  unity  when  extreme  stimuli  are  compared. 
This  introduces  unsatisfactorily  large  numbers  in  the  D'a  table.  This 
difficulty  is  usually  met  by  excluding  all  numbers  beyond,  say,  2.0 
from  the  table.  After  a  preliminary  arrangement  of  columns  so  that 
the  S'i  will  be  in  approximately  proper  order,  the  quantity 

^{D'ij-D\,i,,)/k 

is  computed  where  the  summation  is  over  the  k  values  of  i  for  which 
entries  appear  in  both  column  j  and  y+1  .  Then  differences  between 
such  means  are  taken  as  the  scale  separations  (see  for  example  Guil- 
ford's discussion  (1)  of  the  method  of  paired  comparisons).  This 
method  seems  to  give  reasonable  results.  The  computations  for  meth- 
ods which  take  account  of  the  differing  variabilities  of  the  p'  a  and 
therefore  of  the  D'a  seem  to  be  unmercifully  extensive. 

It  should  also  be  remarked  that  this  solution  is  not  entirely  a 
reasonable  one  because  we  really  want  to  check  our  results  against 
the  original  y'a  .   In  other  words,  a  more  reasonable  solution  might 
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be  one  such  that  once  the  S'i  are  computed  we  can  estimate  the  p'a 
by  p"i; ,  and  minimize,  say, 

'2ip'ij  —  p"ijy 
or  perhaps 


2  (arc  sin  VP  ii  —  arc  sin  ^/p"ii)^. 

Such  a  thing  can  no  doubt  be  done,  but  the  results  of  the  author's 
attempts  do  not  seem  to  differ  enough  from  the  results  of  the  present 
method  to  be  worth  pursuing. 
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THEORETICAL  RELATIONSHIPS  AMONG  SOME 
MEASURES  OF  CONDITIONING 

By  Conrad  G.  Mueller 

Columbia  University 

Communicated  by  C.  H.  Graham,  December  10,  1949 

The  relationships  among  the  various  measures  of  strength  of  conditioning 
constitute  an  important  problem  for  conditioning  theory.  Many  different 
measures  have  been  used.  ^  The  measures  latency  and  magnitude  are  based 
on  the  occurrence  of  a  single  response,  while  number  of  responses  in  extinction 
and  the  rate  of  responding  in  a  "free-response"  situation  are  based  on  more 
than  one  instance  of  a  response.  Probability  of  response  occurrence  is  an- 
other term  that  is  encountered  in  the  hterature ;  it  is  used  most  frequently 
in  cases  where  more  than  one  response  is  possible  (e.g.,  right  and  left  turns 
in  a  7"  maze)  and  in  circumstances  when  it  is  possible  to  compute  the  fre- 
quency or  the  percentage  of  times  that  a  specified  response  is  given.  Per- 
centage of  response  occurrence  is  taken  to  be  an  estimate  of  the  probability 
of  obtaining  the  response. 

Some  theoretical  formulations  are  concerned  with  one  or  two  measures  of 
strength ;  others  are  more  inclusive.  In  only  few  cases  has  an  attempt  been 
made  to  present  a  theory  of  the  relation  among  measures.  In  most  treat- 
ments that  consider  several  measures,  the  relations  among  the  measures 
are  empirically  determined. 

The  purpose  of  the  present  note  is  to  indicate  one  possible  theoretical 
account  of  the  relationships  among  latency  of  response,  rate  of  responding 
and  the  probability  of  occurrence  of  a  response.  The  last  measure  serves 
as  the  starting  point  for  the  discussion  and  provides  the  terms  in  which  the 
other  concepts  are  related. 

Consider  the  Skinner  bar-pressing  situation^  in  which  a  rat's  responses 
may  occur  at  any  time  and  at  any  rate  during  the  period  in  which  the 
animal  is  in  the  experimental  cage.  Assume  that  the  responses  under  con- 
stant testing  conditions  are  randomly  distributed  in  time.  Let  the  rate  of 
occurrence  of  these  responses  be  represented  by  r.  It  may  then  be  shown 
that  the  probability,  P>/,  of  obtaining  an  interval  between  two  responses 
greater  than  t  is 

P>t  =  e-^'  (1) 

where  e  is  the  base  of  Naperian  logarithms.^     The  probability  of  obtaining 
n  responses  in  an  interval,  T,  is 

Pn  =  {rTYe-^^lnl  (2) 

Equation  (1)  gives  us  a  statement  of  the  distribution  of  time  intervals 
associated  with  various  rates  of  responding.     For  example,  for  the  median 

This  article  appeared  in  Proc.  natl.  Acad.  Sci.,  1950,  36,  123-130.  Reprinted  with 
permission. 

159 


160  READINGS   IN    MATHEMATICAL   PSYCHOLOGY 

time  interval  Pyt  is  0.5  and  —rl  is  loge  0.5  or  the  median  /  is  0.69/r.  Equa- 
tion (2)  gives  the  probability  of  various  numbers  of  responses  within  some 
specified  time  interval.  For  example,  the  probability  of  getting  exactly 
one  response  in  an  interval,  T,  is  {rT)e~'^'^.  The  relation  between  equa- 
tions (1)  and  (2)  is  obvious  when  we  consider  the  probability  of  getting  no 
responses  in  an  interval,  T.     In  this  case  Po  is  e~''^. 

Equations  (1)  and  (2)  permit  us  to  transform  a  rate  measure  into  a  proba- 
bility measure.  Since  we  are  dealing  with  a  continuous  distribution 
(time),  the  probability  of  a  response  at  any  particular  time  is  zero,  but  the 
probability  of  a  response  within  given  time  intervals  is  finite  and  specifi- 
able. 

Latency  usually  refers  to  the  time  interval  between  a  stimulus  and  a 
response  and  thus  is  not  directly  considered  in  the  previous  development. 
Assume,  however,  that  the  stimulus  conditions  are  one  determinant  of  the 
rate  of  responding,  that  is,  that  the  rate  has  different  values  for  different 
stimulus  conditions.  This  assumption  is  consistent  with  the  discussions  by 
Skinner  and  others  who  have  emphasized  the  measurement  of  rate;  the 
assumption  would  presumably  be  an  elementary  requirement  for  any 
measure. 

Under  the  circumstances  of  the  assumption,  t  may  be  employed  in  dis- 
cussing latency,  since  the  latter  would  be  the  time  interval  between  the 
beginning  of  the  observation  period  (when  a  stimulus  was  presented)  and 
the  first  response.  Thus,  on  the  assumption  that  stimulus  conditions  are  a 
determinant  of  rate  of  responding  and  on  the  previous  assumption  that  the 
responses  are  randomly  distributed  in  time,  a  statement  of  the  rate  of  re- 
sponding under  specified  stimulus  conditions  implies  a  probability  state- 
ment of  the  delay  of  length  /  between  the  presentation  of  the  stimulus  and 
the  occurrence  of  the  first  response.  This  statement  tells  us  not  only  of  the 
distribution  of  latencies  but  also  of  the  relationship  between  some  repre- 
sentative value,  say  the  median  latency,  and  the  rate  of  responding;  for 
example,  the  probability  of  a  response  greater  than  the  median  latency, 
tmd,  is  0.5;  and  from  equation  (1)  we  see  that  —  rima  =  loge  0.5  or  that 
the  median  latency  equals  0.69/r. 

The  preceding  development  does  not  imply  any  particular  theory  of  con- 
ditioning but  may  be  incorporated  into  a  large  class  of  theories.  For  ex- 
ample, if  the  foregoing  discussion  is  combined  with  a  theory  that  states 
that  rate  of  responding  is  proportional  to  the  number  of  responses  that  re- 
main to  be  given  in  extinction,  the  measure  of  number  of  responses  in 
extinction  is  immediately  related  to  our  latency  and  probability  terms.  In 
other  words,  if 

r  =  k(N-n),  (3) 

where  A^  is  the  number  of  responses  in  extinction,  n  is  the  number  of  re- 
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sponses  already  given,  and  ^  is  a  constant,  we  may  substitute  k{N—n) 
r  in  equation  (1)  and  obtain 


for 


P>t  =  « 


-k(N—n)t 


(4) 


This  equation  may  then  be  examined  for  relationships  existing  among  the 
terms  n,  N,  P  and  /.  In  addition  to  the  relationships  among  latency,  rate 
and  number  of  responses  in  extinction,  equation  (4)  may  be  used  to  predict 
the  distribution  of  responses  in  extinction  for  a  constant  strength  and  the 
distribution  of  time  intervals  between  responses  at  various  stages  of  extinc- 
tion. 

Since  the  present  argument  follows  mainly  from  the  assumption  of  a 
random  distribution  of  responses  in  time,  it  is  of  interest  to  examine  data 


5  10  15  20        25 

t     (TIME    IN    SECONDS) 

FIGURE   1 

The  percentage  of  inter-response  time  intervals 
greater  than  t,  where  t  is  time  in  seconds.  The  data 
are  from  an  experiment  with  white  rats  in  a  bar-press 
situation  as  described  in  the  text.  The  line  drawn 
through  the  data  is  a  plot  of  equation  (1). 

for  direct  evidence  of  randomness  as  well  as  for  evidence  relating  to  the 
above  outlined  consequences  of  randomness. 

The  data  in  figure  1  were  taken  from  measurements  obtained  during  the 
course  of  periodic  reconditioning.^  The  data  represent  the  responses  of  a 
single  animal  diu-ing  a  20-minute  session  of  "three-minute"  periodic  recon- 
ditioning. Within  this  observation  period  the  rate  of  responding  was 
approximately  constant.  The  question  at  issue  is  whether  the  responses  in 
this  interval  are  distributed  randomly.  Equation  (1)  states  that  the  proba- 
bility of  getting  an  interval  between  responses  greater  than  /  is  e^", 
where  r  is  the  rate  of  responding  expressed  in  the  same  units  as  /.     In  the 
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20-minute  session,  238  responses  were  made,  237  time  intervals  were  re- 
corded, and  the  rate  in  this  session  is  0.20  response  per  second.  Thus, 
without  direct  reference  to  the  distribution  of  time  intervals,  theory  speci- 
fies the  distribution  of  time  intervals  between  responses  uniquely.  In  this 
case  the  probabihty  of  getting  a  time  interval  greater  than  t  (in  seconds)  is 
g-o.2oj.  xhe  ordinate  of  figure  1  shows  the  percentage  of  the  intervals  be- 
tween responses  that  were  greater  than  the  various  time  values  specified  on 
the  abscissa.  The  solid  line  through  the  date  in  figure  1  represents  the 
theoretical  function.  The  data  are  consistent  with  the  assumption  that 
the  measiu'ed  responses  occurred  randomly  in  time. 

Although  the  data  of  figure  1  may  be  representative  of  the  agreement  be- 
tween data  and  theory  under  the  conditions  specified,  certain  cases  of  sys- 
tematic deviations  from  theory  may  be  noted.  One  class  of  deviations, 
for  example,  may  be  found  in  cases  where  animals  show  marked  "holding" 
behavior,  i.e.,  where  the  bar  is  depressed  and  held  down  for  many  seconds. 
Although  the  "holding"  period  is  not  a  "refractory"  period^  in  the  usual 
sense  of  the  term,  it  obviously  affects  the  data  in  a  similar  way.  During 
the  "holding"  period,  the  probability  of  response  occurrence  is  zero.  One 
complicating  feature  in  analyzing  responses  characterized  by  "holding"  is 
the  fact  that  "holding"  is  of  variable  length.  The  data  available  at  present 
do  not  warrant  an  extensive  treatment  of  this  problem,  but  the  simplicity 
that  may  result  from  apparatus  changes  designed  to  eliminate  the  factor  of 
"holding"  and  the  advantages  that  may  accrue  from  the  additional  response 
specification  may  be  shown. 

An  example  of  a  distribution  showing  systematic  deviations  from  theory  is 
shown  in  figure  2.  The  computations  and  plot  are  similar  to  those  in 
figure  1.  The  ordinate  represents  the  percentage  of  intervals  between 
responses  greater  than  the  specified  abscissa  values.  The  solid  line  is 
theoretical.  The  constant  of  the  Hne  was  determined,  as  in  the  case  in 
figure  1,  directly  from  the  rate  of  responding  without  reference  to  the  dis- 
tribution of  time  intervals.  The  fit  is  obviously  poor;  the  function  appears 
sigmoid  and  asymmetric. 

Let  us  assume  that  the  analysis  leading  to  equation  (1)  and  applied  to 
figtu-e  1  is  correct  when  applied  to  all  portions  of  the  observation  period 
except  the  time  spent  in  "holding."  An  additional  test  may  then  be 
appHed  to  the  data  from  which  figm-e  2  was  obtained.  Now  we  are  inter- 
ested in  the  meastu-ement  of  the  time  interval  between  the  end  of  one  re- 
sponse and  the  beginning  of  the  next.^  Figure  3  shows  the  results  of  such 
meastuements  in  the  form  of  a  plot  of  the  percentage  of  intervals  between 
the  end  of  one  response  and  the  beginning  of  the  next  that  were  greater  than 
the  specified  abscissa  values.  The  solid  line  through  the  data  is  theoretical 
when  the  rate  term,  r,  is  set  equal  to  the  ratio  of  the  number  of  responses  to 
the  total  time  minus  the  "holding"  time,  i.e.,  to  the  number  of  responses 
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FIGURE  2 

A  plot  similar  to  figure  1  showing  the.deviation  from  theory  in  cases  of  "holding"  be- 
havior.    The  line  drawn  through  the  data  is  a  plot  of  equation  (1). 


t     (TIME    IN    SECONDS) 

FIGURE  3 

The  data  of  figure  2  "corrected  for  holding."  The  plot  is  similar 
to  that  in  figiu-es  1  and  2,  except  that  the  measured  interval  is  the 
time  between  the  end  of  one  response  and  the  beginning  of  the  next 
response.  The  line  drawn  through  the  data  is  a  theoretical  one 
described  in  the  text. 
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per  unit  of  "available"  time.     As  in  figures  1  and  2  the  constant  is  evaluated 
independently  of  the  shape  of  the  distribution  of  intervals. 

Data  relevant  to  the  present  analysis  of  latency  measures  are  not  numer- 
ous. The  agreement  between  the  present  theory  and  the  data  reported 
by  Felsinger,  Gladstone,  Yamaguchi  and  Hull^  is  shown  in  figure  4,  where 
the  percentage  of  latencies  greater  than  specified  abscissa  values  are  plotted. 
The  solid  line  is  the  theoretical  curve.  In  the  case  of  the  latency  data  under 
consideration  it  is  not  possible  to  evaluate  r  independently  of  the  distribu- 
tion of  time  intervals.  In  the  case  of  figure  4  the  constant  was  determined 
by  the  slope  of  a  straight  line  fitted  to  a  plot  of  loge  P>/  against  t. 


The 
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FIGURE  4 

percentage  of  latencies  greater  than  t.     The  data  are  from  figure  1  of  Felsinger, 
Gladstone,  Yamaguchi  and  Hull.* 


Probably  little  is  to  be  gained  at  this  time  by  further  sampling  of  the 
consequences  of  equation  (1),  but  many  additional  tests  of  the  formulation 
may  be  made.  For  some  tests  appropriate  data  are  not  available.  For  the 
tests  that  have  been  tried  the  agreement  between  data  and  theory  is 
promising.  One  prediction  that  has  been  tested  concerns  the  distribution 
of  time  intervals  between  responses  for  a  number  of  animals  at  comparable 
stages  in  extinction.  The  expectation  is  that  at  a  specified  stage  in  extinc- 
tion the  intervals  between,  say,  response  i?„  and  Rn+i^  for  a  large  number  of 
animals,  will  be  distributed  in  a  manner  similar  to  that  shown  in  figure  1 
and  that  the  constant,  r  (therefore  the  steepness  of  drop  of  the  curve)  will 
vary  systematically  with  n.  In  other  words,  the  steepness  of  the  drop  of  a 
curve  such  as  found  in  figure  1  will  depend  on  where  in  extinction  the  inter- 
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vals  are  measured.  In  fact  this  expectation  seems  to  be  borne  out  by  the 
cases  measured,  although  the  number  of  measurements  at  each  stage  of 
extinction  is  not  large. 

Finally,  it  may  be  pointed  out  that  the  form  of  the  present  account  has 
important  consequences  for  the  treatment  of  experimental  data.  Since 
one  of  the  features  of  the  account  is  the  possibility  of  specifying  the  fre- 
quency distributions  of  the  measures  discussed  it  is  possible  to  eliminate 
many  of  the  problems  associated  with  the  arbitrary  selection  of  representa- 
tive values  in  summarizing  data.  On  the  basis  of  the  preceding  equations, 
one  may  state  changes  in  one  statistic,  say  the  arithmetic  mean,  in  terms  of 
changes  in  another,  say  the  geometric  mean  or  the  median.  Therefore, 
data  using  different  statistics  are  made  comparable  and  the  multiplicity  of 
functions  that  may  arise  from  the  use  of  different  descriptive  statistics  not 
only  ceases  to  pose  a  difficult  problem  but  is  actually  an  aid  to  theory  test- 
ing. 

Summary. — A  theoretical  account  of  some  relationships  among  measures 
of  strength  of  conditioning  has  been  considered.  (1)  If  we  assume  that 
responses  in  a  "free-response"  situation  are  randomly  distributed  in  time, 
we  obtain  directly  a  statement  of  the  probability  of  occurrence  of  a  re- 
sponse (or  of  any  number  of  responses)  within  a  specified  time  interval  as  a 
function  of  the  length  of  the  interval  and  of  the  rate  of  responding ;  we  also 
obtain  a  statement  of  the  probability  of  occurrence  of  inter-response  time 
intervals  of  varying  lengths.  (2)  If  we  assume  that,  for  any  specified 
stimulus  condition,  there  corresponds  some  rate  of  responding,  it  turns  out 
that  the  probability  of  occurrence  of  latencies  of  various  lengths  may  be 
specified  for  various  rates  of  responding,  or,  for  a  fixed  probability  value, 
the  relation  between  latency  and  rate  may  be  specified.  (3)  Finally, 
where  these  considerations  are  added  to  a  theory  specifying  the  relationship 
between  rate  of  responding  and  number  of  responses  yet  to  occur,  the 
number  of  responses  in  extinction  may  be  related  to  the  latency  and  proba- 
bility terms  as  well  as  to  rate.  In  addition  to  statements  about  average 
values,  the  present  formulation  has  consequences  for  the  distribution  of 
time  intervals  between  responses  and,  by  extension,  for  the  distribution  of 
latency  measures. 

'  Hull,  C.  L.,  Principles  of  Behavior,  D.  Appleton-Century  Co.,  New  York,  1943. 

^  Skinner,  B.  F.,  The  Behavior  of  Organisms,  D.  Appleton-Century  Co.,  New  York, 
1938. 

^  A  slightly  different  equation  results  if  we  assume  that  a  "refractory"  period  exists, 
i.e.,  that  immediately  after  a  response  there  is  a  period  during  which  the  probability  of 
getting  a  response  is  zero.  If  we  assume  that  the  transition  from  the  "refractory" 
period  to  randomness  is  instantaneous,  the  probability  of  getting  an  interval  greater  than 
t  is 


P>t  =  e 


-r{t-to) 


166  READINGS  IN   MATHEMATICAL  PSYCHOLOGY 

where  to  is  the  "refractory"  period.     The  formulation  is  more  complex  if  the  transition  is 
treated  as  a  gradual  one  or  if  the  "refractory"  period  has  a  variable  length. 

*  The  data  reported  here  were  recorded  by  Mr.  Michael  Kaplan  in  the  Psychological 
Laboratories  of  Columbia  University. 

*  This  is  merely  a  first  approximation.  Subsequent  analyses  may  show  that  the 
interval  between  the  end  of  one  response  and  the  beginning  of  the  next  is  not  independent 
of  the  "holding"  period.  The  results  of  our  procedure  indicate  that  the  approximation  is 
useful  for  the  present. 

*  The  experiment  by  Felsinger,  Gladstone,  Yamaguchi  and  Hull  [/.  Exptl.  Psychol., 
37,  214-228  (1947)  ]  may  not  provide  an  optimal  test  of  our  formulation  for  two  reasons. 
The  first  is  that  the  data  are  reported  in  a  frequency  distribution  with  step  intervals 
which  begin  at  zero.  If  the  shortest  latency  were  greater  than  zero,  starting  the  step 
intervals  at  the  lowest  measure  would  be  more  appropriate.  The  use  of  zero  as  a  lower 
limit  could  easily  make  an  exponential  distribution  more  normal.  The  method  of  sum- 
marizing the  data  may  account  for  the  deviation  of  the  point  at  0.5  second  in  figure  4. 
The  deviation  of  this  point  is  an  expression  of  the  fact  that  the  distribution  reported  by 
Felsinger,  Gladstone,  Yamaguchi  and  Hull  does  not  have  a  maximum  frequency  at  the 
first  step  interval. 

In  the  second  place,  it  may  be' assumed  that  the  many  transient  discriminative  stimuli 
associated  with  the  exposure  of  the  bar  may  play  a  more  important  role  than  the  con- 
tinuous ones  associated  with  the  presence  of  the  bar.  Although  it  is  possible  to  extend 
the  present  notion  to  stimuli  of  short  duration  which  end  before  the  occurrence  of  the  re- 
sponse, additional  assumptions  are  required.  A  less  equivocal  test  of  the  present  theory 
may  be  expected  from  a  distribution  of  latencies  obtained  from  an  experimental  pro- 
cedure of  the  sort  used  by  Skinner  (op.  cit.).  Prick  [J.  Psychol.,  26,  96-123  (1948)]  and 
others.  After  a  period  of,  say,  no  light,  a  light  is  presented  and  stays  on  until  one  re- 
sponse occurs  (Skinner)  or  stays  on  for  some  fixed  period  of  time  sufficiently  long  to  in- 
sure the  occurrence  of  many  responses  (Frick).  Such  experimental  procedures  would 
minimize  unspecified  transient  stimuli  and  would  parallel  more  closely  the  notion 
that  stimulus  conditions  determine  a  rate  of  responding.  The  procedure  used  by 
Frick  has  the  additional  advantage  of  permitting  the  measurement  of  the  time  interval 
between  the  onset  of  the  stimulus  and  the  first  response  and  the  subsequent  intervals 
between  responses  under  "the  same"  stimulus  conditions. 
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The  problem  of  signal  detectability  treated  in  this  paper  is  the  following: 
Suppose  an  observer  is  given  a  voltage  varying  with  time  during  a  prescribed  obser- 
vation interval  and  is  asked  to  decide  whether  its  source  is  noise  or  is  signal  plus  noise. 
What  method  should  the  observer  use  to  make  this  decision,  and  what  receiver  is  a 
realization  of  that  method?  After  giving  a  discussion  of  theoretical  aspects  of  this  prob- 
lem, the  paper  presents  specific  derivations  of  the  optimum  receiver  for  a  number  of 
cases  of  practical  interest. 

The  receiver  whose  output  is  the  value  of  the  likelihood  ratio  of  the  input  volt- 
age over  the  observation  interval  is  the  answer  to  the  second  question  no  matter  which 
of  the  various  optimum  methods  current  in  the  literature  is  employed  including  the  Ney- 
man-Pearson  observer,  Siegert's  ideal  observer,  and  Woodward  and  Davies'  "observer." 
An  optimum  observer  required  to  give  a  yes  or  no  answer  simply  chooses  an  operating 
level  and  concludes  that  the  receiver  input  arose  from  signal  plus  noise  only  when  this 
level  is  exceeded  by  the  output  of  his  likelihood  ratio  receiver. 

Associated  with  each  such  operating  level  are  conditional  probabilities  that  the 
answer  is  a  false  alarm  and  the  conditional  probability  of  detection.  Graphs  of  these 
quantities,  called  receiver  operating  characteristic,  or  ROC,  curves  are  convenient  for 
evaluating  a  receiver.  If  the  detection  problem  is  changed  by  varying,  for  example,  the 
signal  power,  then  a  family  of  ROC  curves  is  generated.  Such  things  as  betting  curves 
can  easily  be  obtained  from  such  a  family.  The  operating  level  to  be  used  in  a  particu- 
lar situation  must  be  chosen  by  the  observer.  His  choice  will  depend  on  such  factors 
as  the  permissible  false  alarm  rate,  a  priori  probabilities,  and  relative  importance  of 
errors. 

With  these  theoretical  aspects  serving  as  an  introduction,  attention  is  devoted 
to  the  derivation  of  explicit  formulas  for  likelihood  ratio,  and  for  probability  of  detec- 
tion and  probability  of  false  alarm,  for  a  number  of  particular  cases.  Stationary,  band- 
limited,  white  Gaussian  noise  is  assumed.  The  seven  special  cases  which  are  presented 
were  chosen  from  the  simplest  problems  in  signal  detection  which  closely  represent 
practical  situations. 

Two  of  the  cases  form  a  basis  for  the  best  available  approximation  to  the  impor- 
tant problem  of  finding  probability  of  detection  when  the  starting  time  of  the  signal, 
signal  frequency,  or  both,  are  unknown.  Furthermore,  in  these  two  cases  uncertainty  in 
the  signal  can  be  varied,  and  a  quantitative  relationship  between  uncertainty  and 
ability  to  detect  signals  is  presented  for  these  two  rather  general  cases.  The  variety  of 
examples  presented  should  serve  to  suggest  methods  for  attacking  other  simple  signal 
detection  problems  and  to  give  insight  into  problems  too  complicated  to  allow  a  direct 
solution. 

1.  Introduction 

The  problem  of  signal  detectability  treated  in  this  paper  is  that  of  determining  a 
set  of  optimum  instructions  to  be  issued  to  an  "observer"  who  is  given  a  voltage 
varying  with  time  during  a  prescribed  observation  interval  and  who  must  judge  whether 
its  source  is  "noise"  or  "signal  plus  noise."  The  nature  of  the  "noise"  and  of  the 
"signal  plus  noise"  must  be  known  to  some  extent  by  the  observer. 

From  Trans.  IRE  Professional  Group  in  Information  Theory,  1954,  PGIT  2-4,  171-212. 
Reprinted  with  permission. 

*  The  work  reported  in  this  paper  was  done  under  U.S.  Army  Signal  Corps  Contract 
No.  DA-36-039SC-15358. 
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Any  equipment  which  the  observer  uses  to  make  this  judgment  is  called  the 
"receiver."  Therefore  the  voltage  with  which  the  observer  is  presented  is  called  the 
"receiver  input."  The  optimum  instructions  may  consist  primarily  in  specifying 
the  "receiver"  to  be  used  by  the  observer. 

The  first  three  sections  of  this  article  survey  the  applications  of  statistical  methods 
to  this  problem  of  signal  detectability.  They  are  intended  to  serve  as  an  introduction 
to  the  subject  for  those  who  possess  a  minimum  of  mathematical  training.  Several 
definitions  of  "optimum"  instructions  have  been  proposed  by  other  authors.  Emphasis 
is  placed  here  on  the  fact  that  these  various  definitions  lead  to  essentially  the  same 
receiver.  In  subsequent  sections  the  actual  specification  of  the  optimum  receiver  is 
carried  out  and  its  performance  is  evaluated  numerically  for  some  cases  of  practical 
interest  [17]. 

1.1  Population  SN  and  N 

Either  noise  alone  or  the  signal  plus  noise  may  be  capable  of  producing  many 
diff'erent  receiver  inputs.  The  totality  of  all  possible  receiver  inputs  when  noise  alone 
is  present  is  called  "Population  N'';  similarly,  the  collection  of  all  receiver  inputs  when 
signal  plus  noise  is  present  is  called  "Population  SN.''  The  observer  is  presented  with  a 
receiver  input  from  one  of  the  two  populations,  but  he  does  not  know  from  which 
population  it  came;  indeed,  he  may  not  even  know  the  probability  that  it  arose  from  a 
particular  population.  The  observer  must  judge  from  which  population  the  receiver 
input  came. 

1.2  Sampling  plans 

A  sampling  plan  is  a  system  of  making  a  sequence  of  measurements  on  the 
receiver  input  during  the  observation  interval  in  such  a  way  that  it  is  possible  to  re- 
construct the  receiver  input  for  the  observation  interval  from  the  measurements. 
Mathematically,  a  sampling  plan  is  a  way  of  representing  functions  of  time  as  sequences 
of  numbers.  The  simplest  way  to  describe  this  idea  is  to  list  a  few  examples. 

A:  Fourier  series  on  an  interval.  Suppose  that  the  observation  interval  begins 
at  time  t^  and  is  T seconds  long,  and  that  each  function  in  the  population  6'A'^and  A'^can 
be  expanded  in  a  Fourier  series  on  the  observation  interval.  The  Fourier  coefficients 
for  each  particular  receiver  input  can  be  obtained  by  making  measurements  on  that 
input,  which  can  in  turn  be  reconstructed  from  these  measurements  by  the  formula 

^  lirnt  l-rrnt 

x(t)  =  ao  +  Z  ^71  COS  ——  +  Zj„  sm  -—-  ,         t^  <  t  <  to  +  T.  (1) 

n  =  l  J  ^ 

Thus  the  process  representing  each  function  x(t)  by  the  sequence  of  its  Fourier  co- 
efficients (oq,  ai,  bi,  .  .  . ,  On,  bn, .  . .)  is  a  sampling  plan  in  the  sense  described  above. 
The  pair  of  terms  in  the  Fourier  series  which  involve  the  cosine  and  sine  of 
InntjT  is  of  frequency  njT cycles  per  second.  Suppose  that  for  a  particular  population 
of  receiver  inputs  the  terms  of  frequency  greater  than  njTa.ve  zero;  i.e.,  the  population 
is  bandlimited  in  the  Fourier  series  sense  or  simply  "series-bandlimited."  For  such  a 
population  the  process  of  representing  each  receiver  input  x(t)  by  the  finite  sequence 
(^0,  <^i,  bi,  .  .  .  ,  ay,  ,  bn  )  is  a  finite  sample  plan.* 

*  A  sampling  plan  is  finite  if  there  is  a  finite  maximum  length  for  the  sequences  for  all 
receiver  inputs  in  the  population. 
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B:  Shannon''s  sampling  plan.  Suppose  that  the  observation  interval  includes 
all  time  and  that  the  populations  are  "transform-bandlimited"  to  a  band  from  0 
to  W  cycles  per  second,  i.e.,  the  Fourier  transform  of  every  receiver  input  is  zero 
for  frequencies  greater  than  W.  A  sampling  plan  for  this  population  is  to  represent 
each  function  x{t)  by  its  amplitude  measured  at  times  spaced  1/2 [^seconds  apart, 
(.  .  .  x{t^  -  njlW), ...,  x(to  -  \J2W\  x(to),  x{t^  +  XjlW),  .  .  .  x(to  +  nj2W),  .  .  .).  In 
this  case  the  formula  [2]  for  the  reconstruction  of  the  receiver  input  is 


x(t)  =     2    a^  ^0  + 


sin  77-[2  W(t  -  to)  -  /?] 
2H^/     7T[2Wit  -  to)  -  n] 


(2) 


The  instants  of  time  ?(,  +  n/lW  are  called  sampling-times.  Each  choice  of  /q  between 
0  and  1/2  ff  yields  a  different  sampling  plan.  If  the  observation  interval  again  includes 
all  time,  but  the  populations  are  transform-bandlimited  to  a  frequency  band  from 
/o  ~  ^/2  to/o  +  W/l  which  does  not  contain  zero  frequency,  then  each  receiver  input 
x(t)  can  be  considered  as  an  amplitude  and  frequency  modulated  waveform,  x(t)  = 
r(t)cos  {iTrfot  +  B{t));  r(t)  is  the  amplitude  of  the  envelope  and  6(t)  is  the  instantane- 
ous phase  of  the  carrier.  A  sampling  plan  employing  sampling-times  is  obtained  in  this 
case  by  representing  each  receiver  input  by  the  sequence  (. . .  K^o)'  ^(^o)'  •  •  • »  K^o  +  "1^)^ 
6(tQ  +  njlV), .  . .)  of  envelope  amplitudes  and  carrier  phases  measured  at  sampling- 
times  spaced  by  1/ff  seconds  apart  [1].  The  reconstruction  of  the  receiver  input  from 
this  sequence  is  given  by 


<t)  =    2    ''  ^0  +  77;  I  ■  cos 

n=  —  CO    \  ''I 


2-/o?  +  0|ro+- 


sin  TT[W{t  -  to)  -  n] 


J     rrlW^t  -  to)  -  n] 


(3) 


C:  Sampling  plan  using  sampling-times  for  a  finite  observation  interval.  Only 
functions  known  for  all  times  have  Fourier  transforms,  and  therefore  the  hypothesis 
that  the  populations  are  transform-bandlimited  applies  only  when  the  observation 
interval  includes  all  time.  If  the  observation  interval  is  of  finite  length  and  if  the 
populations  are  series-bandlimited,  then  there  are  sampling  plans  utilizing  sampling- 
times  which  are  similar  to  those  described  in  paragraph  B  for  transform-bandlimited 
populations  and  an  infinite  observation  interval.  Suppose  that  time  is  measured  from 
the  beginning  of  the  observation  interval,  which  is  T seconds  long,  and  suppose  that  the 
populations  are  series-bandlimited  from  0  to  f^  cycles  per  second.  A  finite  sampling 
plan  for  this  situation  can  be  obtained  by  representing  each  receiver  input  by  the 
sequence  of  its  amplitudes  measured  l/2P'f^  seconds  apart  [1] 


x{to),x\to  + 


1 

2W 


fo  +  T- 


2W 


and  the  reconstruction  of  the  receiver  input  from  this  sequence  is 
2Trr-i 


^1  n   \       sin7r[2W{t  -  t^)  -  n] 


2^Frsin 


2W(t  -to)  -n 


2WT 


,0  <  t  <  T. 


(4) 


(5) 


Again  each  choice  of  the  (initial)  sampling-time  to  between  0  and  1/2^F  yields  a  different 
sampling  plan.   In  a  similar  fashion,  if  the  observation  interval  is  unchanged  but  the 


170  READINGS   IN    MATHEMATICAL    PSYCHOLOGY 

populations  are  series-bandlimited  on  this  interval  to  a  frequency  band  from  /q  —  Wjl 
to/o  +  W/2  which  does  not  include  zero  frequency,  then  each  receiver  input  can  be 
represented  by  a  finite  sequence  [K^o)'  ^(''o)'  ''(''o  +  ^1^),  ^(^o  +  V^),  •  •  •  ,  fUo  +  T  - 
IjW),  6(t()  +  T  —  l/fV)]  of  envelope  amplitudes  and  carrier  phases  measured  at  sample 
points  1/^F  seconds  apart;  /q  is  again  used  to  denote  the  initial  sampling  time  which 
may  be  chosen  anywhere  from  0  to  1/  ff .  The  reconstruction  of  the  receiver  input  from 
this  sequence  of  measurements  is  given  by 


Infy  +  dlto+    " 


W 


sin  77 

[Wit  -  t,)  -  n] 

WTsm 

-     w{t  -  ?„)  -  n 

L"             WT         \ 

0  <t  <T.     (6) 


From  these  examples  it  can  be  seen  that  there  are  a  number  of  important  dif- 
ferences between  various  sampling  plans  such  as  (a)  the  length  of  the  observation 
interval,  (b)  whether  sampling-times  are  employed,  and  (c)  whether  the  measurements 
are  all  to  be  of  the  same  kind,  e.g.,  instantaneous  amplitude  measurements,  or  of  dif- 
ferent kinds,  e.g.,  envelope  amplitude  and  carrier  phase.  However,  they  all  have  in 
common  the  property  that  the  receiver  input  can  be  reconstructed  from  the  measure- 
ments made  on  it. 

The  role  which  the  sampling  plan  plays  in  the  theory  presented  in  this  paper  is 
primarily  one  of  mathematical  convenience.  The  populations  N  and  SN  will  be 
represented  as  sequences  through  the  use  of  sampling  plans  in  order  to  apply  statistical 
methods.  Once  an  answer  is  obtained  concerning  an  "optimum"  receiver,  it  is  often 
possible  to  translate  this  answer  back  to  the  more  familiar  language  of  receiver  inputs. 
If  a  finite-sampling  plan  is  not  available  for  a  particular  application  of  the  theory,  then 
recent  work  by  Grenander  [3]  shows  that  the  desired  parameters  of  the  "optimum" 
receiver  can  be  approximated  by  using  finite-sampling  plans.  Both  for  this  reason  and 
in  order  to  simpify  the  exposition,  the  theory  presented  here  is  restricted  to  cases  where 
finite-sampling  plans  are  available. 

2.  Optimum  Tests  on  Fixed  Observation  Intervals 

2.1  Probability  density  functions 

This  part  of  the  paper  is  concerned  with  a  method  of  statistical  analysis  which 
requires  for  raw  data  a  finite  sequence  of  numbers  (x-^,  x^,  .  .  .  ,  x.„),  which  is  the  result 
of  the  measurements  made  at  the  receiver  input  according  to  some  particular  finite- 
sampling  plan.  The  sequence  is  often  called  a  "sample"  of  the  population  from  which 
it  arose,  and  is  denoted  by  a  single  letter;  thus,  if  the  receiver  input  is  x(t),  and  the 
sampling  plan  yields  a  sequence  (x^,  x^,  .  .  .  ,  x„),  then  this  sequence  is  called  the  sample 
X.  The  theory  to  be  developed  here  is  intended  to  specify  an  optimum  receiver  and  is 
couched  in  the  language  of  samples,  X  =  (x^,  x^,  .  .  .  ,  x^.  If  n  is  very  large,  a  receiver 
which  had  to  make  the  measurements  called  for  by  a  sampling  plan  would  certainly 
be  impractical.  However,  this  practical  difficulty  is  avoided  when  the  specification  of 
the  receiver  is  translated  back  from  the  language  of  samples  to  the  language  of  the 
receiver  inputs;  this  can  be  done  because  it  is  possible  to  reconstruct  the  inputs  from 
the  samples. 
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For  the  purposes  of  the  subsequent  development  any  finite  samping  plan  may 
be  considered,  provided  enough  properties  are  known  of  the  associated  sample  X  so 
that  certain  probabilities  may  be  calculated.  Specifically,  the  probability  density  func- 
tions fs{X)  and/sA'CA')  of  the  sample  variable  X  for  the  cases  when  X  is  drawn  from 
populations  A'^  and  ^A^,  respectively,  must  be  known.*  The  two  basic  properties  of 
density  functions  are 


/ivW>0  \fy{X)dX^\, 


and  (7) 

fsdX)>Q  ^fs^iX)dX^\ 

where  the  integration  symbol  represents  the  multiple  integral  taken  over  the  entire 
range  of  the  sample  variable  X  =  (x-^^,  x^,  .  .  .  ,  x^. 

2.2  The  concept  of  a  criterion 

Consider  now  an  observer  who  has  as  available  data  the  sample  X  =  {x^, . . . ,  x„). 
The  observer's  job  is  to  judge  for  each  sample  whether  or  not  it  was  taken  from  popula- 
tion SN.  Although  it  is  not  possible  to  determine  the  (probably  subconscious)  criterion 
used  by  the  observer,  it  is  quite  possible  to  find  an  external  manifestation  of  it.  Ideally 
all  that  is  necessary  is  to  submit  each  possible  sample  to  the  observer  and  to  record  his 
judgment.  This  will  yield  a  tabulation  of  those  samples  which  the  observer  decided 
were  drawn  from  population  SN.  If  any  other  observer  is  given  this  tabulation  and 
instructed  to  base  his  decisions  on  it,  he  will  behave  exactly  as  did  the  first  observer. 
Thus,  the  tabulation  of  these  responses  can  be  used  to  replace  the  mental  criterion 
employed  by  the  observer.  Such  a  tabulation  will  also  be  called  a  criterion  and  will  be 
denoted  by  the  letter  A,  which  refers  to  the  phraseology  common  in  statistics  of 
"Accepting  the  hypothesis  that  a  signal  is  present."  The  tabulation  of  the  remaining 
samples,  those  which  the  observer  concluded  were  drawn  from  population  A^,  will  be 
denoted  by  B. 

2.3  Probabilities  associated  with  criteria 

There  are,  of  course,  as  many  diflFerent  criteria  as  there  are  observers.  Among 
all  possible  criteria  it  is  necessary  to  select  those  that  are  best  for  various  purposes.  To 
do  so,  certain  numerical  quantities  must  be  associated  with  each  criterion.  It  will  be 
necessary  to  know  the  probability  that  a  sample  from  one  of  the  populations  will  be 
listed  in  a  particular  criterion  A.  According  to  the  standard  definitions,  these  prob- 
abilities are  given  by  « 

and  (8) 


-J., 


where  the  multiple  integral  is  taken  over  all  samples  listed  in  the  criterion  A. 

*  In  this  discussion  it  should  be  kept  in  mind  that  "the  event  of  the  sample  being  drawn 
from  population  SN"  corresponds  to  signal  and  noise  being  present  at  the  receiver  input. 
Also  "the  event  of  population  SN  being  sampled"  means  the  same  thing. 
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For  example,  a  particular  sample  plan  might  have  a  density  function  of  the  form 
fyi^i,  X2,  .  .  .  ,  .r„)  =  Kexp  [—  {x^  +  x^  +  .  .  .  +  x^)].  A  possible  criterion  would 
consist  of  those  samples  X  =  (a^i,  x^,  .  .  .  ,x^)  which  lie  outside  a  sphere  of  radius  1 
centered  at  the  origin.  Then  the  integral  would  be  taken  over  the  exterior  of  this  sphere. 

These  probabilities  have  a  special  significance.  P\(A)  is  the  conditional  prob- 
ability that  a  sample  from  population  TV  will  be  listed  in  criterion  A ;  that  is,  will  be 
judged  as  a  sample  from  population  SN.  Thus  PyiA)  =  F  is  the  conditional  false 
alarm  probability.  Also,  Pgy{A)  is  the  conditional  probability  of  a  certain  kind  of 
correct  response  called  a  hit  (that  of  judging  correctly  that  a  sample  is  from  population 
SN).  The  conditional  probability  of  judging  falsely  that  a  sample  is  from  population 
^'A'^  is,  therefore,  given  by  1  —  Pgy{A)  =  M,  the  conditional  probability  of  a  miss. 
The  only  errors  which  can  occur  are  false  alarms  and  misses ;  their  conditional  prob- 
abilities, F  and  M,  are  called  briefly  the  error  probabilities. 

A  reader  familiar  with  the  formal  content  of  probability  theory  should  note  that 
these  quantities  are  true  conditional  probabilities ;  the  first  is  conditional  on  the  sample 
being  drawn  from  population  SN;  the  second  is  conditional  on  its  being  drawn  from 
population  N.  This  is  to  distinguish  them  from  a  priori  probabilities  (the  probabilities 
that  a  certain  population  will  be  sampled,  for  example)  which  are  not  as  yet  assumed 
known. 

2.4  Likelihood  ratio  and  the  ratio  criteria 

It  is  convenient  to  introduce  a  new  function  called  the  likelihood  ratio,  KX), 
defined  as  the  ratio  fgy(X)lfy(X)  for  sample  points  X  =  (x^,  .  .  .  ,  a:„) ;  l(X)  represents 
the  likelihood  that  the  sample  X  was  drawn  from  SN  relative  to  the  likelihood  that  it 
was  drawn  from  N.  Hence,  if  l(X)is  sufficiently  large,  it  would  be  reasonable  to  conclude 
that  A' was  in  fact  drawn  from  population  SN,  i.e.,  that  Xshould  be  listed  in  the  desired 
"best"  criterion.  Thus, for  each  number  /?  >  0,  a  certain  criterion  A{(i)  will  be  selected; 
/4(^)  is  chosen  by  listing  each  sample  Jffor  which  l(X)  >  p.  The  problem  then  reduces 
to  that  of  making  a  wise  choice  of  ;6;  that  is,  to  determine  how  large  "sufficiently  large" 
is.   Criteria  of  the  form  A{(^)  will  be  called  ratio  criteria. 

A  number  of  writers  have  presented  varying  definitions  of  a  criterion  being 
"optimum."  It  turns  out  that  each  of  these  optimum  criteria  can  be  expressed  as  a  ratio 
criterion,  so  that  a  receiver  designed  to  yield  likelihood  ratio  as  output  could  be  used 
with  any  of  them. 

2.5  Weighted  combination  criteria 

Suppose  it  is  possible  to  assign  a  certain  number  w  as  a  weighting  factor  rep- 
resenting the  importance  of  a  false  alarm  relative  to  a  hit.  Since  Pgy{A)  is  the  prob- 
ability of  a  hit,  and  Py(A)  the  probability  of  a  false  alarm,  it  would  then  be  reasonable 
to  find  a  criterion  A  which  maximizes  the  quantity 

Psy(A)  -  wPy(A).  (9) 

But  this  quantity  can  be  written  as 


I'^- 


(X)  -  wfyiX)]  dX,  (10) 
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where  the  integration  is  taken  over  the  sample  points  X  listed  in  ^4.  To  maximize  this 
integral,  one  would  list  in  A  every  sample  for  which  the  integrand  was  not  negative. 
Solving  that  inequality  for  w,  one  sees  that  A  should  contain  those  sample  points  X  for 

l(X)='f^>w.  (11) 

Thus  the  desired  criterion  A  is  simply  A{w),  and  so  it  is  a  ratio  criterion. 

2.6  Neyman-Pearson  criteria 

If  it  is  critically  important  to  keep  the  probability  of  a  false  alarm  Py{A)  below 
a  certain  level  k,  then  it  would  be  reasonable  to  choose  from  among  such  criteria  that 
one  which  maximizes  the  probability  of  a  hit.  Thus  Neyman  and  Pearson  proposed 
[4]  as  a  type  of  optimum  criterion  any  criterion  Ai^  for  which 

(1)  Py{A^)  <  k,  and 

(2)  Pgy{A,^)  is  a  maximum  for  all  the  criteria  A  with  the  property  P\{A)  <  k. 

The  A^.  type  criterion  can  also  be  expressed  as  a  ratio  criterion.  This  can  be 
made  plausible  as  follows.  To  begin  with,  it  is  necessary  to  consider  only  those  criteria 
A  for  which  Py{A)  =  k,  because  A  will  be  taken  as  large  as  possible  in  order  to  meet 
condition  (2).   Now  consider  the  curve  given  parametrically  by  the  equations 

X  =  X(^)  =  Py[A(li)] 
and  (12) 

Y  =  y(iS)  =  PsylAiP)]. 

This  curve  will  be  called  the  Receiver  Operating  Characteristic  (briefly,  ROC)  curve, 
for  a  receiver  whose  output  is  likelihood  ratio  and  with  which  ratio  criteria  are  being 
used. 

The  ROC  curve  passes  through  the  points  (0,  0)  and  (1,  1),  the  first  at  ^3  =  oo, 
the  second  at  iS  =  0.  At  /S  =  0,  liX)  >  /3  =  0  for  all  X,  so  ^(0)  consists  of  all  possible 
samples.  Thus  the  observer  will  report  that  every  sample  is  drawn  from  SN,  so  he  will 
be  certain  to  make  a  false  alarm  and  to  make  a  hit.  (This  assumes  that  the  samples  will 
not  be  drawn  exclusively  from  one  of  the  populations.)  This  can  be  verified,  using  the 
basic  property  of  the  density  functions  expressed  by  the  following  equations: 


Ps^AA{0)]=jfsd^)dX^l 
and  (13) 

Py[Am   =  (fy(X)dX  =  \, 


where  the  integration  is  taken  over  all  possible  samples  X.  These  equations  mean  that 
X(0)  =  y(0)  =  1.  Moreover,  X{oo)  =  y(co)  =  0,  because  for  /^  =  oo  there  are  no 
samples  X  with  I{X)  >  oo;  i.e.,  A{od)  contains  no  samples  at  all  and  the  operator 
will  never  report  a  signal  is  present.  Therefore,  the  operator  cannot  possibly  make  a 
false  alarm  nor  can  he  make  a  hit.  Thus  Pgy[A((X))]  =  0  and  P\[Aico)]  =  0. 

These  considerations,  together  with  those  of  the  next  section,  show  that  the 
ROC  curve  can  be  sketched  somewhat  as  in  Fig.  1. 
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-  /                I. 

^=  (X,  Y) 

0  XW)  =  P^\A(0)] 

Figure  1 
Typical  ROC  curve. 

To  determine  the  desired  Aj„  recall  that  all  probabilities  lie  between  zero  and 
one,  so  that  Px(Ak)  =  k  is  between  zero  and  one.  Then  there  is  a  point  Q  of  the  ROC 
curve  which  lies  vertically  above  the  point  (k,  0).  The  coordinates  (X,  Y)  of  Q  are 
X  =  Pj^[A(^)]  ^  k  and  Y  =  Psn[MP)1  for  some  /S,  which  will  be  written  /S^.  Now 
A(^k)  satisfies  condition  (1)  because  Pff[A((i;.)]  =  k,  and  therefore  A((ijc)  will  be  the 
desired  Aj,  if  Psj^iA)  <  Pg^[A{(i,^)]  for  any  criterion  with  the  property  that  P^vC/l)  =  k. 
From  paragraph  2.5,  it  is  clear  that  the  ratio  criterion  A((ij^)  is  an  optimum  weighted- 
combination  criterion  with  the  weighting  factor  w  =  ^^.  Therefore,  if  w  =  i?^,  the 
weighted  combination  using  the  criterion  A{^^  is  greater  than  or  equal  to  the  same 
weighted  combination  using  any  other  criterion  A,  i.e.. 


Psn[A(<(^ic)]  -  ^icPnIA{Pic)]  >  Psn(A)  -  hPdA). 


(14) 


In  this  case  both  Pjs;[A(P,c)]  and  PyiA)  are  equal  to  k.   If  this  value  is  substituted  into 
the  inequality  above,  one  obtains 


PsdAi^,)]  >  Psn(A). 


(15) 


Therefore,  the  desired  Neyman-Pearson  criterion  Aj^  should  be  chosen  to  be  this  partic- 
ular ratio  criterion,  A(fij^). 

2.7  ROC  curve 

It  is  desirable  to  digress  for  a  moment  to  study  the  ROC  curve  more  closely. 
Its  value  lies  in  the  fact  that  if  the  type  of  criterion  chosen  for  a  particular  application 
is  a  ratio  criterion,  Aifi),  then  a  complete  description  of  the  detection  system's  perform- 
ance can  be  read  off  the  ROC  curve.  By  the  very  definition  of  the  ROC  curve,  the  X 
coordinate  is  the  conditional  probability  F,  of  false  alarm,  and  the  F  coordinate  is  the 
conditional  probability  of  a  hit.  Similarly  (1  —  X)  is  the  conditional  probability  of 
being  correct  when  noise  alone  is  present,  and  (1  —  Y)  =  M  is  the  conditional  prob- 
ability of  a  miss.   It  will  be  shown  in  a  moment  that  the  operating  level  ^  for  the  ratio 
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criterion  A{(i)  can  also  be  determined  from  the  ROC  curve  as  the  slope  at  the  point 

Since  most  proposed  kinds  of  optimum  criteria  can  be  reduced  to  ratio  criteria,  the 
ROC  curve  assumes  considerable  importance. 

In  order  to  determine  some  of  its  geometric  properties,  it  will  be  assumed  that 
the  parametric  functions 

X  =  X(p)  =  Py[Aim 
and  (16) 

Y  =  n/3)  =  PsnIA^)] 

are  differentiable  functions  of  /3.  The  slope  of  the  tangent  to  the  ROC  curve  is  given  by 
the  quotient  (dYldp)l{dXldP).  To  calculate  the  slope  at  the  point  [A'C/Sq),  y(/So)],  notice 
that  among  all  criteria  A,  the  quantity  Psyi"^)  ~  KPn^"^)  is  maximized  by  ^  =  A{P^. 
Therefore,  in  particular,  the  function 

rm  -  fi,x{ii)  =  Ps^,[Am  -  fioPdAm  (i7) 

has  a  maximum  at  /S  =  /Sq,  so  that  its  derivative  must  vanish  there.  Thus  differentiating, 

dY  dX 

-^-/3o-^=0     atiS  =  iSo.  (18) 

Solving  for  ^q,  one  obtains 

(flfy/^iS).  . 

i8o  =  (JYUR\  ^  ^^^  slope  of  the  tangent  to  the 

{dXld§)p^li^  ROC  curve  at  the  point  [X{^^),  Y{^q)].    (19) 

This  shows  that  the  slope  of  the  ROC  curve  is  given  by  its  parameter  iS,  and  so  is  always 
positive.  Hence  the  curve  rises  steadily.  In  addition,  this  means  that  Y{(i)  can  be  written 
as  a  single  valued  function  of  X{[i),  Y  =  ^X),  which  is  monotone  increasing,  and  where 
7(0)  =  0  and  7(1)  =  1.  These  remarks  make  fully  warranted  the  sketch  of  the  ROC 
curve  given  in  Fig.  1.  The  next  two  sections  are  concerned  with  determining  the  best 
value  to  use  for  the  weighting  factor  w  when  a  priori  probabilities  are  known. 

2.8  Siegerfs  ''Ideal  Observer's'''  criteria 

Here  it  is  necessary  to  know  beforehand  the  a  priori  probabilities  that  popula- 
tion SN  and  that  population  A'^  will  be  sampled.  This  is  an  additional  assumption. 
These  probabilities  are  denoted  respectively  by  P{SN)  and  P{N).  Moreover,  P{SN)  + 
P(N)  =  1  because  at  least  one  of  the  populations  must  be  sampled.  The  criterion 
associated  with  Siegert's  Ideal  Observer  is  usually  defined  as  a  criterion  for  which  the 
a  priori  probability  of  error  is  minimized  (or,  equivalently,  the  a  priori  probability  of  a 
correct  response  is  maximized)  [5].  Frequently  the  only  case  considered  is  that  where 
P(SN)  and  PiN)  are  equal,  but  this  restriction  is  not  necessary. 

Since  the  conditional  probability  F  of  a  false  alarm  is  known  as  well  as  the  a 
priori  probability  of  the  event  (that  population  A^  was  sampled)  upon  which  F  is 
conditional,  then  the  probability  of  a  false  alarm  is  given  by  the  product 

PiN)F.  (20) 


I 
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In  the  same  way  the  probability  of  a  miss  is  given  by 

P(SN)M.  (21) 

Because  an  error  E  can  occur  in  exactly  these  two  ways,  the  probability  of  error  is 
the  sum  of  these  quantities 

PiE)  =  PiN)F  +  PiSN)M.  (22) 

It  has  already  been  pointed  out  that  F  =  P^yiA)  and  M  =  1  —  Pgj>^(A).    If 
these  are  substituted  into  the  expression  for  PiE)  a  simple  algebraic  manipulation  gives 


PiE)  =  PiSN)  -  PiSN) 


PiN) 
PsdA)-^^yP^,iA) 


(23) 


It  is  desired  to  minimize  PiE).  But  from  the  last  equation  this  is  equivalent  to 
maximizing  the  quantity 

PiN) 
^sn(A)  -  ^^y  Py(A\  '  (24) 

and,  of  course,  this  will  yield  a  weighted  combination  criterion  with  w  =  PiN)lPiSN), 
which  is  known  to  be  simply  a  ratio  criterion  Aiw). 

2.9  Maximum  expected-vahie  criteria 

Another  way  to  assign  a  weighting  factor  w  depends  on  knowing  the  "expected 
value"  of  each  criterion.  This  can  be  determined  if  the  a  priori  probabilities  PiSN)  and 
PiN)  are  known,  and  if  numerical  values  can  be  assigned  to  the  four  alternatives.  Let 
Vd  be  the  value  of  detection  and  Vq  the  value  of  being  "quiet,"  that  is,  of  correctly 
deciding  that  noise  alone  is  present.  The  other  two  alternatives  are  also  assigned 
values,  Vm,  the  value  of  a  miss,  and  V^,  the  value  of  a  false  alarm.  The  expected  value 
associated  with  a  criterion  can  now  be  determined.  In  this  case  it  is  natural  to  define  an 
optimum  criterion  as  one  which  maximizes  the  expected  value.  It  can  be  shown  that 
such  a  criterion  maximizes 

PiN)      Vn  -  Vp 


Psn(A)  - 


P(SN)    Vn  -  Vm 


Pn(A).  (25) 


By  definition  (see  paragraph  2.5),  this  criterion  is  a  weighted  combination  criterion  with 
weighting  factor 

PiN)        Vq    -    Vj, 

w  = • ,  (26) 

PiSN)    Vd-Vm'  ^    ^ 

and  hence  a  likelihood  ratio  criterion.  Siegert's  "Ideal  Observer"  criterion  is  the  special 
case  for  which  Vq  —  Vp  —  Vd  —  Vm- 

2.10/4  posteriori  probability  and  signal  detectability 

Heretofore  the  observer  has  been  limited  to  two  possible  answers,  "signal  plus 
noise  is  present"  or  "noise  alone  is  present."  Instead  he  may  be  asked  what,  to  the 
best  of  his  knowledge,  is  the  probability  that  a  signal  is  present.  This  approach  has  the 
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advantage  of  getting  more  information  from  the  receiving  equipment.  In  fact,  Wood- 
ward and  Davies  point  out  that  if  the  observer  makes  the  best  possible  estimate  of  this 
probability  for  each  possible  transmitted  message,  he  is  supplying  all  the  information 
which  his  equipment  can  give  him  [6].  A  good  discussion  of  this  approach  is  found  in 
the  original  papers  by  Woodward  and  Davies  [6,  7].  Their  formula  for  the  a  posteriori 
probability,  PxiSN),  becomes,  in  the  notation  of  this  paper, 

^^       ^       l^^iX)P(SN)  +  (1  -  P(SN)}f^iX) '  ^     ^ 

or 

1{X)P(SN) 
^^^■^^^  =  l{X)P(SN)  +  l  -P(SN)  ■  ^^^^ 

If  a  receiver  which  has  likelihood  ratio  as  its  output  can  be  built,  and  if  the  a  priori 
probability  PiSN)  is  known,  a  posteriori  probability  can  be  calculated  easily.  The 
calculation  could  be  built  into  the  receiver  calibration,  since  (28)  is  a  monotonic  func- 
tion of  l(X) ;  this  would  make  the  receiver  an  optimum  receiver  for  obtaining  a  pos- 
teriori probability. 

3.  Sequential  Tests  with  Minimum  Average  Duration 
3.1  Sequential  testing 

The  idea  of  sequential  testing  is  this :  make  one  measurement  x-^  on  the  receiver 
input;  if  the  evidence  x-^^  is  sufficiently  persuading,  decide  as  to  whether  the  receiver 
input  was  drawn  from  population  SN  or  from  population  A'^.  If  the  evidence  is  not  so 
strong,  make  a  second  measurement  x^  and  consider  the  evidence  (a-^,  x^.  Continue 
to  make  measurements  until  the  resulting  sequence  of  measurements  is  sufficiently 
persuading  in  favor  of  one  population  or  the  other.  Obviously  this  involves  the 
theoretical  possibility  of  making  arbitrarily  many  measurements  before  a  final  decision 
is  made.  This  does  not  mean  that  infinitely  many  measurements  must  be  made  in  an 
actual  application,  nor  does  it  necessarily  mean  that  the  operation  might  entail  an 
arbitrarily  long  interval  of  time.  If,  in  a  particular  application,  measurements  are  taken 
at  evenly  spaced  times  then  the  "time  base"  of  such  a  measurement  plan  is  infinite. 
However,  another  plan  might  call  for  measurements  to  be  made  at  the  instants  /  =  0, 
?  =  1/2,  .  .  .  ,  r  =  (rt  —  l)/«,  and  as  these  times  all  lie  in  the  time  interval  from  zero  to 
one,  such  a  measurement  plan  would  have  a  time  base  of  only  one  unit  of  time. 

If  the  measurement  plan  has  been  carried  out  to  the  stage  where  n  measurements 
x^,  x^,  .  .  .  ,x^  have  been  made,  the  variable  X^  =  {x-^,  x^,  .  .  . ,  a-„)  is  called  the  «th 
stage  sample  variable.  A  specific  plan  for  measurements  will  be  considered  only  if  for 
each  possible  stage  n,  the  two  density  functions /^^CA",,)  and/^(A'„)  of  the  //th  stage 
sample  variable  X„  are  known;  the  first  of  these  density  functions  is  applicable  when 
population  ^'A'^  is  being  sampled  and  the  second  is  applicable  when  population  A'  is 
being  sampled.  These  density  functions  may  very  well  diff"er  at  different  stages,  so  that 
they  should  be  writtenyj^(^„)  and/ifv(A'„);  however,  the  //  appearing  in  the  argument 
Xn  should  always  make  the  situation  clear,  and  the  superscript  on  the  density  functions 
themselves  will  be  omitted. 
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3.2  Sequential  tests 

A  sequential  test  will  consist  of  two  things: 

(1)  An  (infinite)  measurement  plan  with  density  functions/;v(J!f„)  andy^;y(A'„), 

(2)  An  assignment  of  three  criteria  to  each  stage  of  the  measurement  plan. 
These  three  criteria  represent  the  three  possible  conclusions: 

(A)  Signal  plus  noise  is  present,  i.e.  the  sample  comes  from  population  SN, 

(B)  Noise  alone  is  present,  i.e.  the  sample  comes  from  population  TV, 

(C)  Another  measurement  should  be  made. 

At  the  first  stage  of  the  measurement  plan,  any  (real)  number  at  all  could 
theoretically  result  from  the  first  measurement.  This  means  that  the  first  stage  sample 
variable  X-^  =  (.r^)  ranges  through  the  entire  number  system,  which  will  be  written  S^ 
to  stand  for  the  first  stage  sample  space.  Suppose  the  three  first-stage  criteria  A^,  B^, 
and  Q,have  been  chosen.  If  the  sample  X^  is  listed  in  /4i,the  conclusion  that  a  signal  is 
present  is  drawn  and  the  test  is  terminated.  If  it  is  listed  in  By,  the  conclusion  is  that 
noise  alone  is  present,  and  again  the  test  is  terminated.  If  X^  should  be  listed  in  Cy, 
another  measurement  will  be  made,  and  the  test  moves  on  to  the  second  stage  instead 
of  terminating. 

When  the  first  stage  criteria  have  been  chosen,  a  limitation  is  placed  on  5*2,  the 
space  through  which  the  second  stage  sample  variable  X^  —  (x^,  X2)  ranges.  The  only 
way  the  test  can  proceed  to  the  second  stage  is  for  Xy  =  (xy)  to  be  listed  in  Q.  There- 
fore, ^'2  does  not  contain  all  possible  second  stage  samples  X2  =  (x^,  a^g)  but  only  those 
for  which  (xy)  is  listed  in  Q.  Three  second  stage  criteria,  A2,  B2,  and  C2,  must  now  be 
chosen  from  those  samples  X^  listed  in  S2.  They  must  be  chosen  in  such  a  way  that 
there  are  no  duplications  in  the  listings  and  no  sample  in  S2  is  omitted.  These  criteria 
carry  exactly  the  same  significance  as  those  chosen  in  the  first  stage.  That  is,  the  three 
conclusions  that  a  signal  is  or  is  not  present,  or  that  the  test  should  be  continued,  are 
drawn  when  the  sample  X2  is  listed  in  A2,  B2,  or  C2  respectively. 

The  selection  of  criteria  proceeds  in  the  same  way.  If  the  «th  stage  criteria 
An,  B„,  and  C„,  have  been  chosen,  then  the  next  stage's  sample  space  Sn+i  consists  of 
those  samples  Xn+i  —  (x^,  X2,  .  .  . ,  x^,  x^+y)  for  which  X^  =  (x^,  X2, .  .  . ,  x^)  was  listed 
in  C„.  Then  from  S^+i  are  drawn  the  three  («  +  1)  stage  criteria  ^„+i,  B„+y,  and  C„+i. 

When  an  entire  sequence 

(A„  By,  Ci), 
'^2'  "2'  ^2)5 


(^rn   "m  C„), 


of  criteria  is  selected,  a  "sequential  test"  has  been  determined.  This  does  not  mean  of 
course  that  the  test  will  necessarily  be  particularly  useful.  However,  among  all  the 
possible  ways  of  selecting  a  sequence  of  criteria  and  hence  a  sequential  test,  there 
may  be  particular  ones  which  are  very  useful. 
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3.3  Probabilities  associated  with  sequential  tests 

If  Qn  is  any  /?th  stage  criterion,  then  the  quantities* 


and  (29) 

represent  the  (A^  or  SN)  conditional  probabilities  that  an  A?th  stage  sample  Xn  will  be 
listed  in  the  criterion  Q,j.   Conditional  probabilities  of  particular  interest  are: 

(1)  The  A?th  stage  conditional  error  probabilities: 

If  population  A'^  is  sampled,  then  the  probability  that  the  sample  variable  Xn 
will  be  listed  in  A^  is  P^(A„).   This  is  the  A^-conditional  probability  of  a  false  alarm. 

If  population  ^'A'^  is  sampled,  then  the  probability  that  the  sample  variable  X^ 
will  be  listed  in  B„  is  Pg^{B„).    This  is  the  5A^-conditional  probability  of  a  miss. 

(2)  The  conditional  error  probabilities  of  the  entire  test : 

00 

F  =  2  PN^'^n),  the  A^-conditional  probability  of  a  false  alarm,  and        (30) 

oo 

M=  2  ■PsA'(^n)' th^ 'S'A^-conditional  probability  of  a  miss,  (31) 

n  =  \ 

are  merely  the  sums  of  the  same  error  probabilities  over  all  stages. 

(3)  The  conditional  probabilities  of  terminating  at  stage  n  are 

n  =  PNi^n)  +  Pn^B,,),  (32) 

and 

T§N   =  PsNi^n)   +  PsN(Bn).  (33) 

These  equations  can  be  justified  by  a  simple  argument.  The  only  way  the  test  can 
terminate  at  stage  n  is  for  the  sample  variable  X„  to  be  listed  in  either  A„  or  B„.  The 
probability  of  this  event  is  the  sum  of  the  probabilities  of  the  component  events  which 
are  mutually  exclusive  since  Xn  can  be  listed  in  at  most  one  of  A„  and  B„. 

(4)  The  conditional  probabilities  that  the  entire  test  will  terminate  are 


Tn=  1  n,  (34) 


«=i 


and 


=   2  Ts\.  (35) 


3.4.  Average  sample  numbers 

There  are  two  other  quantities  which  must  be  introduced.  One  feature  of  the 
sequential  test  is  that  it  affords  an  opportunity  of  arriving  at  a  decision  early  in  the 
sampling  process  when  the  data  happen  to  be  unusually  convincing.   Thus  one  might 


*  The  notation       indicates  that  the  integration  is  to  be  carried  out  over  all  sample 
points  listed  in  Q„.      ^Qn 
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expect  that,  on  the  average,  the  stage  of  termination  of  a  well-constructed  sequential 
test  would  be  lower  than  could  be  achieved  by  an  otherwise  equal,  good  standard 
test.  It  is  therefore  important  to  obtain  expressions  for  the  average  or  expected  value 
of  the  stage  of  termination.  As  with  other  probabilities,  there  will  be  two  of  these 
quantities:  one  conditional  on  population  N  being  sampled;  the  other  conditional  on 
population  ^A^  being  sampled.   They  are  given  by 

00 

En=  1  nn  (36) 

and 

00 

EsN  =  1  ^Tly.  (37) 

M  =  l 

The  letter  E  is  used  to  refer  to  the  term  "expected  value."  The  quantities  fy  and  E^j^^r 
are  called  the  average  sample  numbers.  The  form  these  formulas  take  can  be  justified 
(somewhat  freely)  on  the  grounds  that  each  value,  n,  which  the  variable  "stage  of 
termination"  may  take  on  must  be  weighted  by  the  (conditional)  probability  that  the 
variable  will  in  fact  take  on  that  value. 

It  should  be  heavily  emphasized  that  the  average  sample  numbers  are  strictly 
average  figures.  In  actual  runs  of  a  sequential  test,  the  stages  of  termination  will  some- 
times be  less  than  the  average  sample  numbers  but  will  also  be,  upon  occasion,  much 
larger.  Any  sequential  test  whose  average  sample  numbers  are  not  finite  would  be 
useless  for  applications.  Therefore  the  only  ones  to  be  considered  are  those  with  finite 
average  sample  numbers.  Under  this  assumption,*  it  can  be  shown  that  Ty  =  Tgy  =  1 
so  that  the  test  is  certain  to  terminate  (in  the  sense  of  probability).  On  the  other  hand, 
if  it  is  known  that  Ty  =  Tgy  =  1  it  does  not  always  follow  that  the  average  sample 
numbers  are  finite.  Such  a  situation  would  mean  only  that  if  a  sequence  of  runs  of  the 
test  were  made,  each  run  would  probably  terminate,  but  the  average  stage  of  termina- 
tion would  become  arbitrarily  large  as  more  runs  were  made. 

3.5  Sequential  ratio  tests 

In  studying  non-sequential  tests  using  finite  samples  it  was  found  that  the  best 
criterion  could  always  be  expressed  in  terms  of  likelihood  ratio.  Therefore,  it  may  be 
useful  to  introduce  likelihood  ratios  at  each  stage  of  an  infinite  sample  plan.  The  nth 
stage  likelihood  ratio  function  liX„)  is  defined  as  the  ratio  /g^7(A'„)//v(^n)-  Optimum 
criteria  in  the  finite-sample  tests  turned  out  to  be  criteria  listing  all  samples  X  for 
which  l(X)  is  greater  than  or  equal  to  a  certain  number.  It  should  be  possible  to  choose 
sequential  criteria  (/4„,  5„,  €„)  in  the  same  way.  For  each  stage  two  numbers  fl„  and 
bn  with  bn  <  fl„  could  be  chosen.  Then  the  criteria  (A„,  B^,  C„)  determined  by  the 
numbers  a„  and  bn  would  be 

An  lists  all  samples  X^  of  the  sample  space  S^  for  which  /(A'„)  >  a„, 
Bn  lists  all  samples  X^,  of  the  sample  space  5„  for  which  l^X^)  <  b^, 
Cn  lists  all  samples  X^  of  the  sample  space  Sn  for  which  b^  <  l{Xn)  <  a^. 

*  Remember  that  the  sampling  process  Is  not  assumed  to  yield  independence  among 
the  Xi. 
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If  criteria  selected  in  this  way  meet  the  requirements  that  the  average  sample  numbers 
be  finite,  then  the  resulting  sequential  test  is  called  a  "sequential  ratio  test." 

3.6  Optimum  sequential  tests 

It  is  customary  [8]  to  define  an  optimum  sequential  test  as  that  one  for  which 
the  average  sample  numbers  £'v  and  £'^v  ^^"^  minimum  among  all  sequential  tests  with 
fixed  error  probabilities  F  and  M. 

In  addition  to  the  formulas  given  in  Section  3.4,  alternative  formulas  [9]  for 
the  average  sample  numbers  are 

00 

£-v  =  1  +  2  P,v(Q)  (38) 

1  =  1 
and 

^-^.Y  =  1  +  2  PsxiCi).  (39) 

i  =  \ 

Thus,  if  a  set  of  sequential  criteria  {A*,  B*,  C*)  is  presented  as  a  possible  optimum 
test,  then  its  optimum  character  is  decided  by  ascertaining  whether  the  inequalities 

iPxiC*)  <J,Py(Q)  (40) 

and 
I  lPssiC*)<2Psx(Q)  (41) 

hold  for  every  other  set  of  sequential  criteria  {{A^,  B^,  C„)}  with  the  same  error  prob- 
abilities, i.e.,  with 

^P^{A*)=J^P^-{Ad  (42) 

and 

lPssiBt)^2Psx^Bi).  (43) 

The  problem  of  constructing  an  optimum  sequential  test  is  difficult  because  the 
equalities  (42)  and  (43)  can  be  satisfied  even  when  there  is  no  apparent  term-by-term 
relation  between  the  sequences  {P^(C*)}  and  {Py{C,)}.  Wald  has  proposed  as  opti- 
mum the  tests  in  which  each  of  the  sequences  {o„}  and  {6„}  is  constant,  that  is,  6^  =  Z)„ 
and  fli  =  a„  for  all  n.  Moreover  Wald  and  Wolfowitz  [10]  proved  that  these  tests  are 
optimum  whenever  the  density  functions  at  successive  stages  are  independent,  as  can 
be  the  case  for  example  when  both  noise  and  signal  plus  noise  consist  of  "random 
noise."  However,  this  "randomness"  is  not  met  with  in  most  applications  of  the  theory 
of  signal  detectability,  at  least  not  in  the  sense  that  the  hypotheses  of  Wald  and  Wolfo- 
witz are  satisfied. 

Consider  a  test  of  fixed  length  as  described  in  Section  2,  with  error  probabilities 
F  and  M.  Although  the  optimum  sequential  test  with  these  same  error  probabilities 
generally  requires  less  time  on  the  average,  it  has  the  disadvantage  that  it  will  sometimes 
use  much  more  time  than  the  fixed  length  test  requires.  In  a  conversation  with  the 
authors,  Professor  Mark  Kac  of  Cornell  University  suggested  that  the  dispersion,  or 
variance,  of  the  sample  numbers  may  be  so  large  as  seriously  to  aff"ect  the  usefulness 
of  the  sequential  tests  in  applications  to  signal  detectability.  Certainly  this  matter 
should  be  investigated  before  a  final  decision  is  reached  concerning  the  merits  of 
sequential  tests  relative  to  tests  on  a  fixed  observation  interval.  However  it  is  a  difficult 
matter  to  calculate  the  variance  of  the  sample  numbers.    Therefore  an  electronic 
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simulator  is  being  built  at  the  University  of  Michigan  which  will  simulate  both  types 
of  tests  and  will  provide  data  for  ROC  curves  of  both  types  as  well  as  the  distribution 
of  the  (sequential)  sample  numbers. 


4.1  Introduction 


4.  Optimum  Detection  for  Specific  Cases 


The  chief  conclusion  obtained  from  the  general  theory  of  signal  detectability 
presented  in  Section  2  of  this  paper  is  that  a  receiver  which  calculates  the  likelihood 
ratio  for  each  receiver  input  is  the  optimum  receiver  for  detecting  signals  in  noise. 


Section 


TABLE  I 

Description  of  Signal  Ensemble 


Application 


4.4  Signal  known  exactly^ 


4.5  Signal  known  except  for  phase" 


4.6  Signal  a  sample  of  white  Gaus- 

sian noise 


Coherent  radar  with  a  target  of 
known  range  and  character 

Ordinary  pulse  radar  with  no 
integration  and  with  a  target 
of  known  range  and  character 

Detection  of  noise-like  signals; 
detection  of  speech  sounds  in 
Gaussian  noise 


4.7  Detector  output  of  a  broad  band 

receiver 


Detecting  a  pulse  of  known  start- 
ing time  (such  as  a  pulse  from 
a  radar  beacon)  with  a  crystal- 
video  or  other  type  broad 
band  receiver 


4.8  A  radar  case  (A  train  of  pulses 

with  incoherent  phase) 


4.10  Signal  one  of  M  orthogonal 

signals 


4.11  Signal  one  of  M  orthogonal  sig- 

nals known  except  for  phase 


Ordinary  pulse  radar  with  inte- 
gration and  with  a  target  of 
known  range  and  character 

Coherent  radar  where  the  target 
is  at  one  of  a  finite  number  of 
non-overlapping  positions 

Ordinary  pulse  radar  with  no 
integration  and  with  a  target 
which  may  appear  at  one  of  a 
finite  number  of  non-over- 
lapping positions 


*  Our  treatment  of  these  two  fundamental  cases  is  based  upon  Woodward  and  Davies' 
work,  but  here  they  are  treated  in  terms  of  likelihood  ratio,  and  hence  apply  to  criterion  type 
receivers  as  well  as  to  a  posteriori  probability  type  receivers.  These  first  two  cases  have  been 
solved  for  the  more  general  problem  in  which  the  noise  is  Gaussian  but  has  an  arbitrary 
spectrum  [11,12].  Those  solutions  require  the  use  of  an  infinite  sampling  plan  and  are 
considerably  more  involved  than  the  corresponding  derivations  in  this  report. 
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It  is  the  purpose  of  Section  4  to  consider  a  number  of  different  ensembles  of 
signals  with  bandlimited  white  Gaussian  noise.  For  each  case,  a  possible  receiver 
design  is  discussed.  The  primary  emphasis,  however,  is  on  obtaining  the  probability  of 
detection  and  probability  of  false  alarm,  and  hence  on  estimates  of  optimum  receiver 
performance  for  the  various  cases. 

The  cases  which  are  presented  were  chosen  from  the  simplest  problems  in  signal 
detection  which  closely  represent  practical  situations.  They  are  listed  in  Table  I  along 
with  examples  of  engineering  problems  in  which  they  find  application.  In  the  last  two 
cases  the  uncertainty  in  the  signal  can  be  varied,  and  some  light  is  thrown  on  the 
relationship  between  uncertainty  and  the  ability  to  detect  signals.  The  variety  of 
examples  presented  should  serve  to  suggest  methods  for  attacking  other  simple  signal 
detection  problems  and  to  give  insight  into  problems  too  complicated  to  allow  a  direct 
solution. 

The  reader  will  find  the  discussion  of  likelihood  ratio  and  its  distribution  easier 
to  follow  if  he  keeps  in  mind  the  connection  between  a  criterion  type  receiver  and 
likelihood  ratio.  In  an  optimum  criterion  type  system,  the  operator  will  say  that  a  sig- 
nal is  present  whenever  the  likelihood  ratio  is  above  a  certain  level  /3.  He  will  say  that 
only  noise  is  present  when  the  likelihood  ratio  is  below  ^.  For  each  operating  level 
/5,  there  is  a  false  alarm  probability  and  a  probability  of  detection.  The  false  alarm 
probability  is  the  probability  that  the  likelihood  ratio  l(X)  will  be  greater  than  fi  if  no 
signal  is  sent;  this  is  by  definition  the  complementary  distribution  function  Fr^{(i). 
Likewise,  the  complementary  distribution  Fg^^^fi)  is  the  probability  that  /(A')  will  be 
greater  than  ^  if  there  is  signal  plus  noise,  and  hence  Fgj^y{P)  is  the  probability  of 
detection  if  a  signal  is  sent. 

4.2  Gaussian  noise 

In  the  remainder  of  this  paper  the  receiver  inputs  will  be  assumed  to  be  defined 
on  a  finite-observation  interval,  0  <  t  <  T.  It  will  further  be  assumed  that  the  receiver 
inputs  are  series-bandlimited.  By  the  sampling  plan  C  (Section  1.2)  any  such  receiver 
input  x{t)  can  be  reconstructed  from  sample  values  of  the  function  taken  at  points  1/2  W 
apart  throughout  the  observation  interval,  i.e., 


2WT 

^(0  =    2   ^kWkit),  (44) 

k  =  l 


t  k 

sin  Tiiwn-  - 


2l^rsin  TT  -  - 


where 


T      IWT^ 

Therefore  the  receiver  inputs  can  be  represented  by  the  sample  (.r^,  x^,  .  .  .  ,  .rgj^r). 
In  Section  4  the  notation  x  will  be  used  to  denote  either  the  receiver  input  function  x{t) 
or  the  sample  {x^,  x^,  .  .  .  ,  x^^yp).  Similarly  the  signal  s{t),  or  simply  s,  can  be  represen- 
ted by  the  sample  {s-^,  .  .  .  ,  %rT)>  where  s^.  =  s{kl2W). 

Only  the  probability  distributions  for  receiver  inputs  x{t)  can  be  specified.  The 
distribution  must  be  given  for  the  receiver  inputs  both  with  noise  alone  and  with  signal 
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plus  noise.    The  probability  distributions  are  described  by  giving  the  probability 
density  functions  /sv^-'^')  and/^,(a:)  for  the  receiver  inputs  x. 

The  probability  density  function  for  the  receiver  inputs  with  noise  alone  are 
assumed  to  be 

..2  ■ 


/ivC^)  =  IT 


1 


1  WittN 


exp 


x^ 

IN 


or 


(46) 


/-(^>  =  I  2^^ 


1     \M/2 


exp 


.      2N^/'] 


where  «  is  2  WT  and  A''  is  the  noise  power.  It  can  be  verified  easily  that  this  probability 
density  function  is  the  description  of  noise  which  has  a  Gaussian  distribution  of 
amplitude  at  every  time,  is  stationary,  and  has  the  same  average  power  in  each  of  its 
Fourier  components.  Thus  we  shall  refer  to  it  as  "stationary  bandlimited  white 
Gaussian  noise." 

The  functions  rpjjit)  are  orthogonal  and  have  energy  1/2  PF,  and  therefore 


0 


[x{t)fdt. 


so  that 


1     \»/2 


^^^^^    =    '^        "'^P 


1  n 


x(tf  dt 


(47) 


(48) 


where  Nq  =  Nl  IV  is  the  noise  power  per  unit  bandwidth. 

In  a  practical  application,  information  is  given  about  the  signals  as  they  would 
appear  without  noise  at  the  receiver  input,  rather  than  about  the  signal  plus  noise 
probability  density.  Then  fs^i^)  must  be  calculated  from  this  information  and  the 
probability  density  function  f;^(x)  for  the  noise.  The  noise  and  the  signals  will  be 
assumed  independent  of  each  other. 

If  the  input  to  the  receiver  is  the  sum  of  the  signal  and  the  noise,  then  the 
receiver  input  x{t)  could  have  been  caused  by  any  signal  s{t)  and  noise  n{t)  =  x(t)  —  s(t). 
The  probability  density  for  the  input  x  in  signal  plus  noise  is  thus  the  probability 
(density)  that  s(t)  and  x(t)  —  s{t)  will  occur  together,  averaged  over  all  possible  ^(O. 
If  the  probability  of  the  signals  is  described  by  a  density  function y^(5),  then 


/s.v(^)  =  J/yvC-^  -  sYs(^)  ds, 


(49) 


where  the  integration  is  over  the  entire  range  of  the  sample  variable  s.  A  more  general 
form  is  used  when  the  probability  of  the  signals  is  described  by  a  probability  measure 
P,^;  the  formula  in  this  case  is 

(50) 


fsd^)  =  \fN(^  -  s)dPs{s). 


This  integral  is  a  Lebesgue  integral,  and  is  essentially  an  "average"  of  f^{x  —  s)  over 


W.    W.    PETERSON,   T.    G.    BIRDSALL,    AND    W.    C.    FOX 


185 


all  values  of  s  weighted  by  the  probability  Pg.    If /^rCi)  is  taken  from  Eq.  (46),  this 
becomes 


/. 


w(^)  =).- 


f^ix  -s)dPsis) 


1 

2^N 


exp 


1 

2/V 


/ 1  Y^' 

\2WV/ 
1  =  1    J  J 


1     " 


dPgis) 


(51) 


2N 


i  =  1     J 


exp 


^2 -A- 
. '  I =1 


dPgis), 


«/2 


fsyi^)  =  J/vC-  -  ^)  ^^sW  =  i:^]      f  exp  f  -  -^i-  |    [.r(r)  -  s(t)f  dt\  dPgis) 

2/2 


2777V 


^n 


/  1  f'^     r    1  f^     If     r     1  f^     1      r 2  r^ 

=  I 1    exp x^  dt      exp     —  tt"       ^^  ^f    exp    — -       xs  dt 


(52) 
dPsis). 


The  factor  exp 


0 


-(l/7Vo)        -^''(0^? 


exp  [  -(1/2A'^)  S  x|]  can  be  brought  out  of  the 
integral  since  it  does  not  depend  on  s,  the  variable  of  integration.  Note  that  the  integral 

J  ^(0'^?=^2^?  =E{s)  (53) 

is  the  energy  *  of  the  expected  signal,  while 

'  \    x{t)s{t)dt=^2^iSi  (54) 

is  the  cross  correlation  between  the  expected  signal  and  the  receiver  input. 

4.3  Likelihood  ratio  with  Gaussian  noise 

Likelihood  ratio  is  defined  as  the  ratio  of  the  probability  density  functions 
fgyix)  and  fj^^{x).  With  white  Gaussian  noise  it  is  obtained  by  dividing  Eq.  (51)  and  (52) 
by  (46)  and  (48)  respectively : 


/(.r)  =     exp 


l{x)  =     exp 


E(s) 
E(s)' 


exp 


exp 


1    ^ 

TV 


^x 


dPsis), 


L^OJO 


x(t)sit)  dt 


dPsis). 


(55) 


(56) 


If  the  signal  is  known  exactly  or  completely  specified,  the  probability  for  that 
signal  is  unity,  and  the  probability  for  any  set  of  possible  signals  not  containing  s  is 
zero.  Then  the  likelihood  ratio  becomes 


or 


Ip)  =  exp 


/,(.xO  -  exp 


E(sy 

No. 

E(sf 


exp 


exp 


1     " 


N, 


x{t)sit)  dt 


(57) 


(58) 


Thus  the  general  formulas  (55)  and  (56)  for  likelihood  ratio  state  that  /(.r)  is  the  weighted 
*  This  assumes  that  the  circuit  impedance  is  normalized  to  one  ohm. 
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average  of  //.r)  over  the  set  of  all  signals,  i.e., 

l{x)  =  (l^{x)dPsis). 


(59) 


An  equipment  which  calculates  the  likelihood  ratio  l(x)  for  each  receiver  input 
X  is  the  optimum  receiver.  The  form  of  equation  (58)  suggests  one  form  which  this 
equipment  might  take.  First,  for  each  possible  expected  signal  s,  the  individual  like- 
lihood ratio  lg(x)  is  calculated.  Then  these  numbers  are  averaged.  Since  the  set  of 
expected  signals  is  often  infinite,  this  direct  method  is  usually  impractical.  It  is  fre- 
quently possible  in  particular  cases  to  obtain  by  mathematical  operations  on  Eq.  (58)  a 
different  form  for  l(x)  which  can  be  recognized  as  the  response  of  a  realizable  electronic 
equipment,  simpler  than  the  equipment  specified  by  the  direct  method.  It  is  essentially 
this  which  is  done  in  the  foUov/ing  paragraphs. 

If  the  distribution  function  Pgis)  depends  on  various  parameters  such  as  carrier 
phase,  signal  energy,  or  carrier  frequency,  and  if  the  distributions  in  these  parameters 
are  independent,  the  expression  for  likelihood  ratio  can  be  simplified  somewhat.  If 
these  parameters  are  indicated  by  r^,  r^, .  ■  ■ ,  r^,  and  the  associated  probability  density 
functions  are  denoted  hy  f-^{r-^),  /^{r^,  .  .  .  ,fn{rn),  then 

The  likelihood  ratio  becomes 


l{x)  =     •  •  •    I  s(^)f lifi)  ■  ■  -fnii-n)  drj^-  ■  ■  drn 
=  \\fnirn)---      \fiiri)lsi^)dr,     ■  ■  ]  dr^. 


(60) 


Thus  the  likelihood  ratio  can  be  found  by  averaging  lg(x)  with  respect  to  the  parameters . 

4.4  77?^  case  of  a  signal  known  exactly 

The  likelihood  ratio  for  the  case  when  the  signal  is  known  exactly  has  already 
been  presented  in  Section  4.3 : 

r     Fi        n    «      n 

(61) 

(62) 

As  the  first  step  in  finding  the  distribution  functions  for  l(x),  it  is  convenient  to 
find  the  distribution  for  (IjN)  S  x^s^  when  there  is  noise  alone.  Then  the  inputs  = 
(.^i,  x^,  .  .  . ,  x.„)  is  due  to  white  Gaussian  noise.  It  can  be  seen  from  Eq.  (46)  that  each 
Xf  has  a  normal  distribution  with  zero  mean  and  variance  A'^  —  WNq  and  that  the  x^ 
are  independent.  Because  the  Sj  are  constants  depending  on  the  signal  to  be  detected, 
s  =  (5i,  52,  ...  ,  Sr,),  each  summand  {x^Si)jN  has  a  normal  distribution  with  mean 
SilN  times  the  mean  of  x^,  and  with  variance  (sjNf  times  the  variance  ofx^,  which  are 
zero  and  5'f/A'^  respectively.  Because  the  x^  are  independent,  the  summands  {siX^jN  are 
independent,  each  with  normal  distribution,  and  therefore  their  sum  has  a  normal 


lix) 

=  exp 

~       E~ 

exp 

' 

l(x) 

=  exp 

~       E~ 

exp 

r2   C'^ 

J^o  Jo 

sit)  dt 
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distribution  with  mean  the  sum  of  the  means — i.e.,  zero — and  variance  the  sum  of  the 
variances. 

5?       IWEis)      IE  Signal  Energy 


„.,       IWEis)       IE 


N 


N 


N. 


Noise  Power  Per  Unit  Bandwidth 


(63) 


The  distribution  for  (1/A'^)  H  x^Si  with  noise  alone  is  thus  normal  with  zero  mean  and 
variance  lEjN^.   Recalling  from  Eq.  (61) 


l{x)  =  exp 


TVn       N- 


(64) 


one  sees  that  the  distribution  for  (1/A'^)  21  x^s^  can  be  used  directly  by  introducing  a 
defined  by 


=  exp 


E 

ha 


or     a  =  —  +  In  iS 


The  inequality  l{x)  >  /?  is  equivalent  to  {XjN)  S  x^si  >  a,  and  therefore 


Fn(P)  = 


47tE 


exp 


llE-^ 


dy. 


(65) 


(66) 


The  distribution  for  the  case  of  signal  plus  noise  can  be  found  by  using  Eq.  (19), 
which  states  that 

Because  these  probabilities  are  equal  to  the  complementary  distribution  functions  for 
likelihood  ratio,  this  can  be  written  as 

dFsNil^)  =  ^  dFs{(i). 
Differentiating  Eq.  (66), 


^^A'(^)=-./^exp 


AE 


and  combining  (65),  (68),  and  (69),  one  obtains 


^0 

dFs^iP)=  -J^exp 


E 

1-  a  - 


da.. 


AE 


(68) 
(69) 


doL. 


Thus, 


Fsyii^)  = 


ue]. 


exp 


4  A  A^o/. 


dy. 


(70) 


(71) 


In  summary,  a  and  therefore  In  /3,  have  normal  distributions  with  signal  plus  noise  as 
well  as  with  noise  alone;  the  variance  of  each  distribution  is  lEjN^,  and  the  difference 
of  the  means  is  2EJNq. 

The  receiver  operating  characteristic  curves  in  Figs.  2  and  3*  are  plotted  for 
any  case  in  which  In  /  has  a  normal  distribution  with  the  same  variance  both  with  noise 
alone  and  with  signal  plus  noise.  The  parameter  d  in  this  figure  is  equal  to  the  square  of 

*  In  Fig.  3,  the  receiver  operating  characteristic  curves  are  plotted  on  "double-proba- 
bility" paper.    On  this  paper  both  axes  are  linear  in  the  error  function 


>  erf  (x)  =  (1/ V277)  •  exp  [-fil]  dt; 

J  —  CO 

this  makes  the  receiver  operating  characteristic  straight  lines. 
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Figure  2 
Receiver  operating  characteristic.    In  /  is  a  normal  deviate  with 


the  difference  of  the  means,  divided  by  the  variance.  These  receiver  operating  charac- 
teristic curves  apply  to  the  case  of  the  signal  known  exactly,  with  d  =  IEJNq. 

Eq.  (62)  describes  what  the  ideal  receiver  should  do  for  this  case.  The  essential 


0 


operation  in  the  receiver  is  obtaining  the  correlation,       ^(0^(0  dt-  The  other  opera- 
Jo 

tions,  multiplying  by  a  constant,  adding  a  constant,  and  taking  the  exponential  func- 
tion, can  be  taken  care  of  simply  in  the  calibration  of  the  receiver  output.  Electronic 
means  of  obtaining  cross  correlation  have  been  developed  recently  [13]. 

If  the  form  of  the  signal  is  simple,  there  is  a  simple  way  to  obtain  this  cross 
correlation  [6,  7].  Suppose  h{t)  is  the  impulse  response  of  a  filter.  The  response  e^it) 
of  the  filter  to  a  voltage  x{t)  is 


e^it) 


J — < 


x{t)  hit  —  t)  dr. 


(72) 
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If  a  filter  can  be  synthesized  so  that 

h{t)  =  s{T  -  t),       0  <t  <T 
h(t)  =  0,  otherwise, 


then 


0- 


eoiT)  =        xir)s(r)dT, 
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(73) 


(74) 


so  that  the  response  of  this  filter  at  time  Tis  the  cross  correlation  required.   Thus,  the 
ideal  receiver  consists  simply  of  a  filter  and  amplifiers. 


Vd- 


I 


/S  =  3'6 

/ 

/ 

y 

/ 

/ 

/ 

y 

/ 

/ 

/ 

/ 

/ 

/ 

y 

f 

. 

.' 

/ 

y 

/ 

/d  =  25             1 

/ 

y 

/ 

/ 

/ 

/' 

/- 

=  16 

/ 

/ 

/ 

/ 

/ 

/d  =  9 

/ 

/ 

2 

/ 

/ 

/ 

y 

2 

/ 

/^ 

J 

y 

y 

^  f 

/ 

y 

7. 

/ 

A    ' 

/ 

^ 

/ 

.    ' 

/ 

/ 

~^ 

— /^ 

-A- 

/ 

7^ 

/ 

Xa 

V 

=  4 

/ 

/ 

/ 

A 

/ 

A 

V 
=  1 

/ 

/ 

/ 

'' 

/ 

} 

=  0 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

T^ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/- 

/ 

/ 

V 

-4*. 

/ 

J 

/ 

z 

/ 

Z 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

/ 

0.1    0.3  0.5    12       4    6  810        20      30    40    50   60     70 

100F^(0 


90      yb 


99.9 


99.5 
99 

98 
97 
96 
94 
92 
90 

80 

70 

60    5 

50  d 

o 

40    2 

30 
20 

10 

8 

6 

4 

3 

2 

1 
0.5 

0.1 


Figure  3 
Receiver  operating  characteristic.    In  /  is  a  normal  deviate,  a\y  =  cr^-,  (A/^y  —  Mjy)-  =  da\-. 
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It  should  be  noted  that  this  filter  is  the  same,  except  for  a  constant  factor,  as  that 
specified  when  one  asks  for  the  filter  which  maximizes  peak  signal  to  average  noise- 
power  ratio  [14]. 

4.5  Signal  known  except  for  carrier  phase 

The  signal  ensemble  considered  in  this  section  consists  of  all  signals  which  differ 
from  a  given  amplitude  and  frequency  modulated  signal  only  in  their  carrier  phase,  and 
all  carrier  phases  are  assumed  equally  likely. 

s{t)  =f{t)  cos  [oit  +  <i>{t)  -  6].  (75) 

Since  the  unknown  phase  angle  6  has  a  uniform  distribution, 


ATT 


(76) 


The  likelihood  ratio  can  be  found  by  applying  Eq.  (56),  and  since  the  signal  energy 
E{s)  is  the  same  for  all  values  of  the  carrier  phase  6, 


l{x)  =  exp 


E 


exp 


^  ^i^i 


dPsis). 


Expanding  s  into  the  coefficients  of  cos  d  and  sin  6  will  be  helpful : 

s(t)  =f{t)  cos  [oit  +  cj>{t)]  cos  B  +f{t)  sin  [cot  +  <f>(t)]  sin  6, 


and 


1    ^  1    NT 

Tr  2.^1^1  =  COS  Q  —  Z^ifih)  cos  [ojti  +  <i>{t,)] 

1  ^ 

+  sin  6  —  Z^if(fi)  sin  [oiti  +  ^(ti)].' 


(77) 


(78) 


(79) 


Because  we  wish  to  integrate  with  respect  to  6  to  find  the  likelihood  ratio,  it  is 
easiest  to  introduce  parameters  similar  to  polar  coordinates  (r,  6^)  such  that 


and  therefore 


1  1  ^ 

—  r  cos  0„  =  T^Z^ifih)  COS  [oJti  +  4>{ti)] 
1  1    ^ 

-  r  sm  00  =  ]^Z  *i/(0  sin  [wti  +  <}>{t,)], 


1  ^  r 

^Z^i^i  =;^cos(0  -  ^o)- 


Using  this  form  the  likelihood  ratio  becomes 


l(x)  =  exp 

~      E~ 

r2n 

exp 
Jo 

=  exp 

"       E~ 

'•&)• 

-cos(0  -0o) 


dd 

2^ 


(80) 


(81) 


(82) 


where  /q  is  the  Bessel  function  of  zero  order  and  pure  imaginary  argument. 
*  ti  denotes  the  /th  sampling  time,  i.e.,  ti  =  ijlW. 
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■  /q  is  a  Strictly  monotone  increasing  function,  and  therefore  the  likelihood  ratio 

will  be  greater  than  a  value  (i  if  and  only  if  rjN  is  greater  than  some  value  corresponding 
to  p. 

In  the  previous  section  it  was  shown  that  the  sum  OIN)  2  x^si  has  a  normal  dis- 
tribution with  zero  mean  and  variance  2EJNq  if  the  receiver  input  x{t)  is  due  to  noise 
alone;  E  is  the  energy  of  the  signal  known  exactly,  ^(0,  and  TVq  is  the  noise  power  per 
cycle.  Since  /(O  cos  [ojt  +  ^(01  and  f{t)  sin  [ojt  +  <^(0]  are  signals  known  exactly, 
both  (rj N)  cos  (f>Q  and  (rl N)  sin  (pQ  have  normal  distributions  with  zero  mean  and 
variance  IEJNq.  The  probability  that  due  to  noise  alone 


r 

N 


N 


cos  0n     + 


A^ 


sin  6, 


will  exceed  any  fixed  value,  is  given  by  the  well  known  chi-square  distribution  for  two 
degrees  of  freedom,  K^icc^).  The  proper  normalization  yielding  zero  mean  and  unit 
variance  requires  that  the  variable  be 


iVn 


2E{s)  ' 


that  is 


I 


N, 


^-'^V2^^°''  =i^.(a^)=exp 


If  a  is  defined  by  the  equation 


A^n 


2E 


the  distribution  for  l(x)  in  the  presence  of  noise  alone  is  in  the  simple  form 

ft  r     r,2- 

Fn(P)  =  exp 

It  follows  from  (85)  that 


(83) 


(84) 


(85) 


dFj^^iP)  =  —a.  exp 


da. 


(86) 
(87) 


If  in  equation  (68),  namely 

f^dFyiP)  ^dFs.,m, 

^  is  replaced  by  the  expression  given  in  (84)  and  dFy(f^)  is  replaced  by  that  given  in 
(86),  then 

E 

exp 


dFsN(^)  =  -exp 


^0. 


IE 


a  I  do 


is  obtained.  Integration  of  (88)  yields 


Psn(^)  =  exp 


^n 


/•  CO 
.  1 


exp 


Nn 


IE 


(88) 


(89) 


*  The  symbol  P(x  >  a)  denotes  the  probability  that  the  variable  .r  is  not  less  than  the 
constant  a. 
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Figure  4 
Receiver  operating  characteristic.   Signal  known  except  for  phase. 

Eqs.  (85)  and  (89)  yield  the  receiver  operating  characteristic  in  parametric  form,  and 
Eq.  (84)  gives  the  associated  operating  levels  [15].  These  are  graphed  in  Fig.  4  for  some 
of  the  same  values  of  signal  energy  to  noise  power  per  unit  bandwidth  as  were  used 
when  the  phase  angle  was  known  exactly,  Figs.  2  and  3,  so  that  the  effect  of  knowing 
the  phase  can  be  easily  seen. 

If  the  signal  is  sufficiently  simple  so  that  a  filter  could  be  synthesized  to  match  the 
expected  signal  for  a  given  carrier  phase  0  as  in  the  case  of  a  signal  known  exactly, 
then  there  is  a  simple  way  to  design  a  receiver  to  obtain  likelihood  ratio.  For  simplicity 
let  us  consider  only  amplitude  modulated  signals  \^{i)  =  0]  in  Eq.  (75).  Let  us  also 
choose  0=0.  (Any  phase  could  have  been  chosen.)  Then  the  filter  has  impulse 
response 

h{t)  =f{T  -  t)  cos  {oj{T  -  t)l       0  <t  <T, 
=  0,  otherwise. 


(90) 
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The  output  of  the  filter  in  response  to  T(t)  is  then 


?o(0  =  x(r)h{t  -  t)  ^T  =  x(r)f{T  +  T  -  t)  cos  C«(t  +  T  -  t)  dr 

J -co  Jt-T 


=  cos  C0(T  —  t)  x{T)f{T   +   T  —  t)  cos  WT  dr 

Jt~T 


—  sin  co{T 


-I 


x{T)f{T  +  T  —  t)  sin  ojT  dr. 


(91) 


The  envelope  of  the  filter  output  will  be  the  square  root  of  the  sum  of  the  squares 
of  the  integrals,*  and  the  envelope  at  time  Twill  be  proportional  to  rjN,  since 


r 

2W 


fT 


x{r)f{T)  COS  lOr  dT 


+ 


!:(t)/(t)  sin  COT  dr 


(92) 


which  can  be  identified  as  the  square  of  the  envelope  of  ^qCO  at  time  T.  If  the  input 
x(t)  passes  through  the  filter  with  an  impulse  response  given  by  Eq.  (90),  then  through  a 
linear  detector,  the  output  will  be  (Nj2)rlN  at  time  T.  Because  the  likelihood  ratio, 
Eq.  (82),  is  a  known  monotone  function  of  /-/TV,  the  output  can  be  calibrated  to  read 
the  likelihood  ratio  of  the  input. 

4.6  Signal  consisting  of  a  sample  of  white  Gaussian  noise 

Suppose  the  values  of  the  signal  voltage  at  the  sample  points  are  independent 
Gaussian  random  variables  with  zero  mean  and  variance  S,  the  signal  power.  The 
probability  density  due  to  signal  plus  noise  is  also  Gaussian,  since  signal  plus  noise  is 
the  sum  of  two  Gaussian  random  variables: 


/s.v(^) 


1 


2tt{N  +  S) 


?i/2 


exp 


1       1 


2N  +  S 


1-f 


where  n  =  1 WT. 

The  likelihood  ratio  is 


l{x)^ 


N 


N  +  S 


nli 


exp 


!  1  y  ^2  _  1  __}__  y  ^,2 

2N^    '       2N  +  S^    ' 


(93) 


(94) 


In  determining  the  distribution  functions  for  /,  it  is  convenient  to  introduce  the 
parameter  a,  defined  by  the  equation 


TV     V«/2 


exp 


(95) 


,A^  +  SJ  ""^  \N  +  S2 
Then  the  condition  l(x)  >  ;8  is  equivalent  to  the  condition  that  (\IN)  2  xf  >  cr.  In 
the  presence  of  noise  alone  the  random  variables  xJVn  have  zero  mean  and  unit 
variance,  and  they  are  independent.  Therefore,  the  probability  that  the  sum  of  the 
squares  of  these  variables  will  exceed  or  is  the  chi-square  distribution  with  n  degrees 
of  freedom,  i.e.,  _  ^^.       ,.  ,  ox 


FsiP)  =  K,,{rf?). 


(96) 


*  If  the  line  spectrum  o^  s(t)  is  zero  at  zero  frequency  and  at  all  frequencies  equal  to  or 
greater  than  Ico/ln,  then  it  can  be  shown  that  these  integrals  contain  no  frequencies  as  high  as 

co/Itt. 
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Similarly,  in  the  presence  of  signal  plus  noise  the  random  variables  xj  Vn  +  S  have 
zero  mean  and  unit  variance.  The  condition  iljN)  E  xf  >  a^  is  the  same  as  requiring 
that  [IJiN  +  S)]  S  xf  >  [Nl(N  +  5)]a2,  and  again  making  use  of  the  chi-square  dis- 
tribution, 


FsdP)  =  K„ 


N  +  S 


(97) 


For  large  values  of  n,  the  chi-square  distribution  is  approximately  normal  over 
the  center  portion;  more  precisely  [16],  for  a^  >  0, 


Fd^)  =  K^io^') 


VItt 


exp 


1  ; 

2^ 


dy, 


•v/2a2- \/2w-l 


and 


i     N        \         1  f 


exp 


--r 


dy. 


(98) 


(99) 


If  the  signal  energy  is  small  compared  to  that  of  the  noise,  VnI(N  +  S)  is  nearly  unity 
and  both  distributions  have  nearly  the  same  variance.  Then  Figs.  2  and  3  apply  to 
this  case  too,  with  the  value  of  d  given  by 


d  =  {ln  -  1)    1  - 


TV 


N  +  S 


(100) 


For  these  small  signal  to  noise  ratios  and  large  samples,  there  is  a  simple 
relation  between  signal  to  noise  ratio,  the  number  of  samples,  and  the  detection 
index  d. 


1  - 


and 


TV  1  5  S 


(101) 


Two  signal  to  noise  ratios,  {SjN\  and  {SlN)^,  will  give  approximately  the  same  operat- 
ing characteristic  if  the  corresponding  numbers  of  sample  points,  Wj  and  n^,  satisfy 


A^ 


^X2 

N} 


(102) 


By  Eq.  (94),  the  likelihood  is  a  monotone  function  of  2  xf.  But  the  output  of  an 
energy  detector. 


o(0=J  {<t)fdt=-:^J^xf 


(103) 


is  proportional  to  S  a;?.   Therefore  an  energy  detector  can  be  calibrated  to  read  likeli- 
hood ratio,  and  hence  can  be  used  as  an  optimum  receiver  in  this  case. 
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4.7   Video  design  of  a  broad  band  receiver 

The  problem  considered  in  this  section  is  represented  schematically  in  Fig.  5. 
The  signals  and  noise  are  assumed  to  have  passed  through  a  band  pass  filter,  and  at  the 
output  of  the  filter,  point  A  on  the  diagram,  they  are  assumed  to  be  limited  in  spectrum 
to  a  band  of  width  ^Fand  center  frequency  coJItt  >  H^j2.  The  noise  is  assumed  to  be 
Gaussian  noise  with  a  uniform  spectrum  over  the  band'.  The  signals  and  noise  then 
pass  through  a  linear  detector.  The  output  of  the  detector  is  the  envelope  of  the  signals 
and  noise  as  they  appeared  at  point  A;  all  knowledge  of  the  phase  of  the  receiver  input 
is  lost  at  point  B.  The  signals  and  noise  as  they  appear  at  point  B  are  considered  re- 
ceiver inputs,  and  the  theory  of  signal  detectability  is  applied  to  these  video  inputs  to  as- 
certain the  best  video  design  and  the  performance  of  such  a  system.  The  mathematical 
description  of  the  signals  and  noise  will  be  given  for  the  signals  and  noise  as  they  appear 
at  point  A.  The  envelope  functions,  which  appear  at  point  B,  will  be  derived,  and  the 
likelihood  ratio  and  its  distribution  will  be  found  for  these  envelope  functions. 

The  only  case  which  will  be  considered  here  is  the  case  in  which  the  amplitude  of 
the  signal  as  it  would  appear  at  point  A  is  a  known  function  of  time. 

Any  function  at  point  A  will  be  band  limited  to  a  band  of  width  W  and  center 
frequency  ajjln  >  W/l.   Any  such  function/(0  can  be  expanded  as  follows: 

f(t)  =  x(t)  cos  m  +  y(t)  sin  cot,  (105) 

where  x(t)  and  y(t)  are  band  limited  to  frequencies  no  higher  than  PF/2,  and  hence  can 
themselves*  be  expanded  by  sampling  plan  C,  yielding 


i 

The  amplitude  of  the  function /(r)  is 


X I  -^  I VM  cos  CO?  +  2/  (  —  I  y>i{t)  sin  cot 


r(t)  =  V[xit)f  +  [yiOr, 
and  thus  the  amplitude  at  the  /th  sampling  point  is 


The  angle 


'''Tf)'"'*'  "  ^^""^  ^^*' 


=  arctan  —  =  arccos  — 


(106) 


(107) 


(108) 


(109) 


might  be  considered  the  phase  of/(0  at  the  /th  sampling  point.  The  function/(0  then 
might  be  described  by  giving  the  r^  and  d^  rather  than  the  x^  and  y^. 


Input  from 
antenna 
or  mixer 


Band  pass 
filter 


Linear 
detector 


Video 
amplifier 


Point  A-^  Point  B- 

FlGURE  5 

Block  diagram  of  a  broad  band  receiver. 

*  Because  any  function /(/)  at  A  has  no  frequency  greater  than  {collir)  -\-  {WjD,  the 
usual  sampling  plan  C  might  have  been  used  on/(/).  However,  the  distribution  in  noise  alone, 
fxi^i),  would  probably  not  be  applicable. 
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Let  us  denote  by  x^,  y^,  or  r^,  Oi,  the  sample  values  for  a  receiver  input  after  the 
filter  (i.e.,  at  the  point  A  in  Fig.  5).  Let  Oi,  bi,  or/,  <^j,  denote  the  sample  values  for 
the  signal  as  it  would  appear  at  point  A  if  there  were  no  noise.  The  envelope  of  the 
signal,  hence  the  amplitude  sample  values  Z^,  are  assumed  known.  Let  us  denote  by 
-fs(^i5  4>2'  ■  ■  •  ■>  ^n/2)  the  distribution  function  of  the  phase  sample  values  <^j.  The 
probability  density  function  for  the  input  at  A  when  there  is  white  Gaussian  noise  and 
no  signal,  with  «  =  2  WT,  is 


1      W2 


fN(-,y)  =\j;;^)    exp 


I     /  w/2  nj2 


and  for  signal  plus  noise,  it  is 

fsd-, y)  =[^)  J/xp  [-  -^2/-.  -  cid'  +l(yi  -  bd' 


(110) 


dPsiaAl   (111) 


Expressed  in  terms  of  the  (r,  6)  sample  values,  Eq.  (110)  and  Eq.  (Ill)  become 


J      \n/2  w/2 


1       m/2     - 


and 


fsNir,  6)  = 


2      \n/2   w/2 


2777V 


1  =  1      J  R 


I      w/2 


n  '■^     exp    -  —  I  [rl  +fl  -  2r,/;.  cos  (6,  -  ^,)] 


2N, 


(112) 


(113) 


dFsi4>i,  ...,  <Pn,2)- 


The  factors  Ilrj  are  introduced  because  they  are  the  Jacobian  of  the  transformation 
from  the  x,  y  sampling  plan  to  the  r,  d  sampling  plan  [16].* 

The  probability  density  function  for  r  alone,  i.e.,  the  density  function  for  the 
output  of  the  detector,  is  obtained  simply  by  integrating  the  density  functions  for  r 
and  Q  with  respect  to  Q. 


fdr) 


J "277     /•277  /'27; 

0   Jo        Jo 


Mri,Oddd,dd,---dd 


or 
and 

or 

/s'v('-) 
or 


I  \w/2   n/2 


1        «/2 

—  2 

27V  .fi 


(114) 


fsyi'-)  =  r  r-  ■  ■  f  "Vs.v(^>  Gi)  d6x  dd,---  dB 

Jo   Jo         Jo 

W2      /,,;;. 


2  \n/2  1^     w/2 


1       W./2 

^■"■4=1 


n/o(^K^('^i,</'2---'^w2,) 


1  \w/2   w/2  /ff. 


1      w/2 


(115) 


For  example,  in  two  dimensions, /^(a;,  y)  dx  dy  =f^{r,  B)r  dr  dd. 
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Notice  that  the  probability  density  for  r  is  completely  independent  of  the  dis- 
tribution which  the  <^,:  had;  all  information  about  the  phase  of  the  signals  has  been 
lost. 

The  likelihood  ratio  for  a  video  input,  r{t),  is 


1      nl2       - 


A^ 


(116) 


Again  it  is  more  convenient  to  work  with  the  logarithm  of  the  likelihood  ratio.  Thus, 

(118) 


£■        w/2  /r  f\ 


which  is  approximately 

E  [^        [r{t)f{t)' 

In  /[KO]  =  -—  +  W\    \nl,         ' 

■'^0  Jo 


dt. 


(119) 


The  function  In  Iq{x)  is  approximately  the  parabola  a'2/4  for  small  values  of  x 
and  is  nearly  linear  for  large  values  of  x.  Thus,  the  expression  for  likelihood  ratio  might 
be  approximated  by 

+  TTro   I     V{tmf{t)fdt  (120) 


In  IHt)]  =  - 
for  small  signals,  and  by 


ln/[KO]  =  Ci  +  Q      r{t)f{t)dt 


(121) 


for  large  signals,  where  Q  and  Cg  are  chosen  to  approximate  In  Iq  best  in  the  desired 
range. 

The  integrals  in  Eqs.  (120)  and  (121)  can  be  interpreted  as  cross  correlations. 
Thus  the  optimum  receiver  for  weak  signals  is  a  square  law  detector,  followed  by  a 
correlator  which  finds  the  cross  correlation  between  the  detector  output  and  [f{i)T, 
the  square  of  the  envelope  of  the  expected  signal.  For  the  case  of  large  signal  to  noise 
ratio,  the  optimum  receiver  is  a  linear  detector,  followed  by  a  correlator  which  has  for 
its  output  the  cross  correlation  of  the  detector  output  and/(0,  the  amplitude  of  the 
expected  signal. 

The  distribution  function  for  /(r)  cannot  be  found  easily  in  this  case.  The 
approximation  developed  here  will  apply  to  the  receiver  designed  for  low  signal  to  noise 
ratio,  since  this  is  the  case  of  most  interest  in  detection  studies.  An  analogous  approxi- 
mation for  the  large  signal  to  noise  ratios  would  be  even  easier  to  derive. 

First  we  shall  find  the  mean  and  standard  deviation  for  the  distribution  of  the 
logarithm  of  the  likelihood  ratio  as  shown  above, 


^nm^-^^lff^^Jrftt 


(122) 
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for  the  case  of  small  signal  to  noise  ratio.  The  probability  density  functions  for  each 
r,-  are 


and 


gsdfi)  =7^e^P 

2N 

h 

^dfi)  =]^exp 

r     rll 

IN 

. 

(123) 


The  notation  ^;v(^i)  and^^^Cr^)  is  used  to  distinguish  these  from  the  joint  distributions 
of  all  the  ri  which  were  previously  called  f^{r)  and  fsN^^)-  The  mean  of  each  term 
rff^l4N^  in  the  sum  in  Eq.  (122)  is 


or 


Similarly, 


(rf  +0- 

2N 

'■m 

dn. 

Ji              'i 

r    rn 

~  2N_ 

—  zm    dn 


The  second  moment  of  each  term  rff^jAN^  is 


f^SN 


167V 


or 


r_m\  ^  ji_  r± 


^^^'^167V^j  "I67v0o  ^'^""^ 


^^f^lM\<in 


2N 


N 


Similarly, 


or 


^""{leN^j  ~\6N^]o  N^^''^'''^'^'" 

rift  \  _  ft  r^ 

Jo  N^ 


16N' 


16N 


00  „5 

2   .      Ara^xp 


2N 


dri 


The  integrals  for  the  case  of  noise  alone  can  be  evaluated  easily : 


and 


t^N 


M-N 


(r!ft\ 

.fi 

2N' 

rfft] 
16iVV 

ft 

2N^' 

(124) 


(124) 


(125) 


(126) 


The  integrals  for  the  case  of  signal  plus  noise  can  be  evaluated  in  terms  of  the  confluent 
hypergeometric  function,  which  turns  out  for  the  cases  above  to  reduce  to  a  simple 
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polynomial.    The  required  formulas  are  collected  in  convenient  form  in  Threshold 
Signals  [5]  on  page  174.   The  results  are 

4yv7     2  A^\       2yv, 

and  (127) 


Since 

a\Z)  =  KZ')  -  [KZ)]\  (128) 

the  variances  of  rff^lAN^  are 


^Nr 


"f'\  =lJL(i+f^ 


v4iVV       4  7V2\  Nj' 

and  (129) 


^lf?\        // 


2  1    ^7^ 


^^\47V2;        47V2 

For  the  sum  of  independent  random  variables,  the  mean  is  the  sum  of  the  means 
of  the  terms  and  the  variance  is  the  sum  of  the  variances.  Therefore  the  means  of  In  l(r) 
are 

1      w/2  w/2  /\     f^  }     f^\  ^1^      f^ 

i  ...[.n'wi  =  - ^  I /;=  +  2  (jI  +  5^)  =2  ,i. 

and  (130) 

n/2     /-S  1    m/2     /-S 

..Dn/«i  =  -2j  +  -,2f  =  o. 

and  the  variances  of  In  l{r)  are 


^l.[In/(.)]=2(^^,+^^) 


and  (131) 

If  the  distribution  functions  of  In  /(/-)  can  be  assumed  to  be  normal,  they  can  be 
obtained  immediately  from  the  mean  and  standard  deviation  of  the  logarithm  of 
likelihood  ratio. 

Let  us  consider  the  case  in  which  the  incoming  signal  is  a  rectangular  pulse  which 
is  M/  MP^  seconds  long.  *  The  energy  of  the  pulse  is  half  its  duration  times  the  amplitude 
squared  of  its  envelope,  for  a  normalized  circuit  impedance  of  one  ohm. 

*  The  problem  of  finding  the  distribution  for  the  sum  of  M  independent  random  vari- 
ables, each  with  a  probability  density  function /(a;)  =  x  exp  [—{\){x'^  +  a~)]I(,{ax)  arises  in  the 
unpublished  report  by  J.  I.  Marcum,  A  Statistical  Theory  of  Target  Detection  by  Pulsed  Radar: 
Mathematical  Appendix,  Project  Rand  Report  R-113.  Marcum  gives  an  exact  expression  for 
this  distribution  which  is  useful  only  for  small  values  of  M,  and  an  approximation  in  Gram- 
Charlier  series  which  is  more  accurate  than  the  normal  approximation  given  here.  Marcum's 
expressions  could  be  used  in  this  case,  and  in  the  case  presented  in  Section  4.6. 
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Thus  of  the  WT  numbers  (fi),  there  are  M  consecutive  ones  which  are  not  zero.  These 
are  given  by 

U  =7^  ,  (132) 

where  E  is  the  pulse  energy  at  point  A  in  Fig.  5  in  the  absence  of  noise.  For  this  case, 
Eq.  (130)  and  Eq.  (131)  become 

1  £2 
/i^;vDn/(r)]  =— ^, 

/.^[ln/(r)]  =0,  (133) 

£■2    /  2   E 

0|^[ln/(r)]=- 

and 


"-""'WI-mFjI'+m^ 


The  distribution  of  In  /(/-)  is  approximately  normal  if  M  is  much  larger  than  one, 
for,  by  the  central  limit  theorem,  the  distribution  of  a  sum  of  M  independent  random 
variables  with  a  common  distribution  must  approach  the  norma)  distribution  as  M 
becomes  large.  The  actual  distribution  for  the  case  of  noise  alone  can  be  calculated  in 
this  case,  since  the  convolution  integral  for  Xh^gj^ixi)  with  itself  any  number  of  times 
can  be  expressed  in  closed  form.  The  distribution  of  In  /(r)  for  signal  plus  noise  is  more 
nearly  normal  than  its  distribution  with  noise  alone,  since  the  distributions  gsN^.^'i) 
are  more  nearly  normal  than^v('*i)- 

The  receiver  operating  characteristic  for  the  case  M  =  16  is  plotted  in  Fig.  6 
using  the  normal  distribution  as  approximation  to  the  true  distribution.  In  many  cases 
it  will  be  found  that 

1     2E 

In  such  a  case  the  distributions  have  approximately  the  same  variance.    Assuming 
normal  distribution  then  leads  to  the  curves  of  Figs.  2  and  3,  with 


4.8  A  radar  case 


1    (2Ef 


This  section  deals  with  detecting  a  radar  target  at  a  given  range.  That  is,  we  shall 
assume  that  the  signal,  if  it  occurs,  consists  of  a  train  of  M  pulses  whose  time  of  occur- 
rence and  envelope  shape  are  known.  The  carrier  phase  will  be  assumed  to  have 
a  uniform  distribution  for  each  pulse  independent  of  all  others,  i.e.,  the  pulses  are 
incoherent. 

The  set  of  signals  can  be  described  as  follows : 

M-l 

^(0  =   2  /(^  +  ^^^  cos  {(ot  +  0,.),  (136) 

m  =  0 

where  the  M  angles  6^  have  independent  uniform  distributions,  and  the  function  /, 
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Figure  6 
Receiver  operating  characteristic.   Broad  band  receiver  with  optimum  video  design,  M 

which  is  the  envelope  of  a  single  pulse,  has  the  property  that 


16. 


Jo 


j\t  +ir)f{t  +jr)dt 


IE 


(137) 


where  (5,;  is  the  Kronecker  delta  function,  which  is  zero  if  /  ^  j,  and  unity  if  /  =  /. 
The  time  t  is  the  interval  between  pulses.  Eq.  (137)  states  that  the  pulses  are  spaced  far 
enough  so  that  they  are  orthogonal,  and  that  the  total  signal  energy  is  E.  *  The  function 
f(t)  is  also  assumed  to  have  no  frequency  components  as  high  as  loJItt. 

*  The  factor  2  appears  in  (137)  because /(O  is  the  pulse  envelope;  the  factor  M  appears 
because  the  total  energy  £  is  M  times  the  energy  of  a  single  pulse. 
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The  likelihood  ratio  can  be  obtained  by  applying  Eq.  (56).   Then 


/(.i)  =      exp 


A^n 


exp 


"2     r 
No  Jo  ' 


s(t)x(t)  dt 


dPsis) 


(138) 


or 


l{x)  =  exp 


E 

TVn 


J'2tt  r-lir 

0         Jo 


exp 


N«  Jo 


T  M-l 

2   f{t  +  tm)x{t)  cos  {(x>t  +  0,„)  cf/ 

?)i  =  0' 


The  integral  can  be  evaluated,  as  in  Section  4.5,  yielding 

E 


l{x)  =  exp 


No. 


M-\ 

n  h 

m  =  Q 


dd,---dB,j_^.       (139) 


(140) 


where 


■  2 
\NoJo 


fit  +  mT)x{t)  cos  CO?  t/r 


+ 


0  JO 


/(/  +  mT)x{t)  sin  ojt  dt 


(141) 

This  quantity  r„j  is  almost  identical  with  the  quantity  r  which  appeared  in  the 
discussion  of  the  case  of  the  signal  known  except  for  carrier  phase,  Section  4.5.  In  fact, 
each  r„i  could  be  obtained  in  a  receiver  in  the  manner  described  in  that  section.  The 
quantity  Tq  is  connected  with  the  first  pulse;  it  could  be  obtained  by  designing  an  ideal 
filter  for  the  signal 

Soit)  =  fit)  cos  {cot  +  e)  (142) 

for  any  value  of  the  phase  angle  Q,  and  putting  the  output  through  a  linear  detector. 
The  output  will  be  (Nj2)rjN  at  some  instant  of  time  r^  which  is  determined  by  the  time 
delay  of  the  filter.  The  other  quantities  r^  differ  only  in  that  they  are  associated  with 
the  pulses  which  come  later.  The  output  of  the  filter  at  time  t^  +  nn  will  be  iNj2)rmlN. 
It  is  convenient  to  have  the  receiver  calculate  the  logarithm  of  the  likelihood 
ratio. 

Thus  the  In  Io(r„jN)  must  be  found  for  each  r,„,  and  these  M  quantities  must  be  added. 
As  in  the  previous  section,  r„jN  will  usually  be  small  enough  so  that  In  Iq(x)  can  be 
approximated  by  x^l4.  The  quantities  ^(r„J^f  can  be  found  by  using  a  square  law 
detector  rather  than  a  linear  detector,  and  the  outputs  of  the  square  law  detector  at 
times  to,  t^  +  T,  .  .  . ,  tQ  +  (M  —  1)t  then  must  be  added.  The  ideal  system  thus 
consists  of  an  IF  amplifier  with  its  passband  matched  to  a  single  pulse,*  a  square  law, 
detector  (for  the  threshold  signal  case),  and  an  integrating  device. 

We  shall  find  normal  approximations  for  the  distribution  functions  of  the 
logarithm  of  the  likelihood  ratio  using  the  approximation 

In/nlT^'l^^,  (144) 


47V2 


*  It  is  usually  most  convenient  to  make  the  ideal  filter  (or  an  approximation  to  it)  a  part 
of  the  IF  amplifier. 
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which  is  vahd  for  small  values  of  r.,jN.*   Substitution  of  (144)  into  (143)  yields 

In/^  -77+1    7   TT    •  (145) 

The  distributions  for  the  quantities  r„j  are  independent;  this  follows  from  the  fact  that 
the  individual  pulse  functions/(/'  +  wt)  cos  {cut  +  0,„)  are  orthogonal.  The  distribu- 
tion for  each  is  the  same  as  the  distribution  for  the  quantity  r  which  appears  in  the 
discussion  of  the  signal  known  except  for  phase;  the  same  analysis  applies  to  both  cases. 
Thus,  by  Eq.  (83)t 
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The  density  functions  can  be  obtained  by  differentiating  (146)  and  (147): 
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(148) 


This  is  the  same  situation,  mathematically,  as  appeared  in  the  previous  section.  The 
standard  deviation  and  the  mean  for  the  logarithm  of  the  likelihood  ratio  can  be  found 
in  the  same  manner,  and  they  are 


MNl 


/i^v(ln  /)  =  0, 


a|^-(ln/)  = 


£2 

MNl 


1  + 


2E 

MM 


(149) 


and 


4(ln  /)  = 


MN^ 


If  the  distributions  can  be  assumed  normal,  they  are  completely  determined  by 
their  means  and  variances.  These  formulas  are  identical  with  the  formulas  (1 33)  of  the 
previous  section.  The  problem  is  the  same,  mathematically,  and  the  discussion  and 
receiver  operating  characteristic  curves  at  the  end  of  Section  4.7  apply  to  both  cases. 

*  See  the  footnote  below  equation  (131). 

t  The  A/ appears  in  the  following  equations  because  the  energy  of  a  single  pulse  is  E'M 
rather  than  E. 
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4.9  Approximate  evaluation  of  an  optimum  receiver 

In  order  to  obtain  approximate  results  for  the  remaining  two  cases,  the  assump- 
tion is  made  that  in  these  cases  the  receiver  operating  characteristic  can  be  approxi- 
mated by  the  curves  of  Figs.  2  and  3,  i.e.,  that  the  logarithm  of  the  likelihood  ratio  is 
approximately  normal.  This  section  discusses  the  approximation  and  a  method  for 
fitting  the  receiver  operating  characteristic  to  the  curves  of  Figs.  2  and  3. 

By  (68),  Fg^(l)  can  be  calculated  if  Fj^{l)  is  known.  Furthermore,  it  can  be  seen 
that  the  nih.  moment  of  the  distribution  Fj^{l)  is  the  {n  —  l)th  moment  of  the  distribu- 
tion Fgy{l).  Hence,  the  mean  of  the  likelihood  ratio  with  noise  alone  is  unity,  and  if 
the  variance  of  the  likelihood  ratio  with  noise  alone  is  a%,  the  second  moment  with 
noise  alone,  and  hence  the  mean  with  signal  plus  noise,  is  1  -I-  a'^.  Thus  the  difiFerence 
between  the  means  is  equal  to  a%,  which  is  the  variance  of  the  likelihood  ratio  with 
noise  alone.  Probably  this  number  characterizes  ability  to  detect  signals  better  than 
any  other  single  number. 

Suppose  the  logarithm  of  the  likelihood  ratio  has  a  normal  distribution  with 
noise  alone,  i.e., 


^.v(/)  = 


1 


Vlird  Jmi 


J' 

Jin 


exp 


(x  —  nif 
2d 


dx. 


(150) 


where  m  is  the  mean  and  d  the  variance  of  the  logarithm  of  the  likelihood  ratio.   The 
«th  moment  of  the  likelihood  ratio  can  be  found  as  follows: 


J '00                                         1  /•oo 

l^^dFyil)  =      exp  [nx]  exp 

0  '  VlndJ-oD 


(x  —  nif- 
Id 


dx,       (151) 


where  the  substitution  /  =  exp  x  has  been  made.    The  integral  can  be  evaluated  by 
completing  the  square  in  the  exponent  and  using  the  fact  that 

exp 


f 


Thus, 


2d 


dx  =  \/2tt  d. 


fJ-.wil'')  =  exp 


'n^d 


(152) 


+  mn 


In  particular,  the  mean  of  l(x),  which  must  be  unity,  is 

~d 
l^dO  =  1  =  exp 


-|-  m 


and  therefore 


m  =  —  - 


(153) 


(154) 


The  variance  of  l(x)  with  noise  alone  is  a^,  and  therefore  the  second  moment  of  l(x)  is 


f^Nil')  =  l/'NiOf  +  o%(l)  =  1    +  o%(l), 

and  this  must  agree  with  (152).  It  follows  that 

/*iv(^^)  ~  ^  +  "n  —  6xp  [2d  +  2m]  =  exp  [d]. 


(155) 


(156) 


and  therefore 

^  =  ln(l  +  a%).  (157) 

The  distribution  of  likelihood  ratio  with  signal  plus  noise  can  be  found  by 
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applying  Eq.  (68).   Thus 


dFsdn  =IciFJl), 


(158) 


Fsx(l) 


ldFj,{l). 


If  dF^{l)  is  obtained  from  Eq.  (150)  and  /  is  replaced. by  exp  .r,  then 


f'sNil) 


V2 


f'Oj 

77-  d  Jlni 


exp  [.r]  exp 


1        f^ 

Vl-ndJlnl 


V      {X  +  dllf-] 

2d 

{X  -  djlfl 

J 

dx. 

2d 

dx 


(159) 


Thus  the  distribution  of  In  /  is  normal  also  when  there  is  signal  plus  noise,  in  this  case 
with  mean  ^/2  and  variance  d. 

In  summary,  it  is  probable  that  the  variance  oy  of  the  likelihood  ratio  measures 
ability  to  detect  signals  better  than  any  other  single  number.  If  the  logarithm  of 
likelihood  ratio  has  a  normal  distribution  with  noise  alone,  then  this  distribution  and 
that  with  signal  plus  noise  are  completely  determined  if  ajf  is  given.  The  distribution  of 
In  l(x)  is  normal  in  both  cases.  Its  variance  in  both  cases  is  d,  which  is  also  the  diflFerence 
of  the  means.  The  receiver  operating  characteristic  curves  are  those  plotted  in  Fig.  2, 
with  the  parameter  d  related  to  cry  t>y  the  equation 

d  =  ln(l  +  a%).  (160) 

In  the  case  of  a  signal  known  exactly,  this  is  the  distribution  which  occurs.  In 
the  cases  of  Section  4.6,  Section  4.7,  and  Section  4.8  this  distribution  is  found  to  be  the 
limiting  distribution  when  the  number  of  sample  points  is  large.  Certainly  in  most 
cases  the  distribution  has  this  general  form.  Thus  it  seems  reasonable  that  useful 
approximate  results  could  be  obtained  by  calculating  only  a%  for  a  given  case  and 
assuming  that  the  ability  to  detect  signals  is  approximately  the  same  as  if  the  logarithm 
of  the  likelihood  ratio  has  a  normal  distribution.  On  this  basis,  orv(/)  is  calculated 
in  the  following  sections  for  two  cases,  and  the  assertion  is  made  that  the  receiver 
operating  characteristic  curves  are  approximated  by  those  of  Fig.  2  with  d  —  ln(l  +  o^^). 

4.10  Signal  which  is  one  of  M  orthogonal  signals 

Suppose  that  the  set  of  expected  signals  includes  just  M  functions  5^(0,  all  of 
which  have  the  same  probability,  the  same  energy  E,  and  are  orthogonal.    That  is, 

s,(t)s,(t)  dt  =  Ed,^.  (161) 


S" 

Jo 


Then  the  likelihood  ratio  can  be  found  from  Eq.  (56)  to  be 


^    1 
K^)  =  2  77  exp 

I      M 

/(.)=-I^exp 


A^n 


exp 


_        z  =  1 


(162) 


J^  i  =  l  ■'^OJ 


where  s^t  are  the  sample  values  of  the  function  s^it). 
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With  noise  alone,  each  term  of  the  form 

n 
i  =  l 

has  a  normal  distribution  with  mean  zero  and  variance 


V  _M  — 
i  =  i  N   ~  N, 


IE 

0 


Furthermore,  the  M  different  quantities 


yy-l  X.^"-^^^ 


i  =  l 


are  independent,  since  the  functions  ^^^CO  are  orthogonal.   It  follows  that  the  terms 

exp 


1  =  1 


are  independent. 

Since  the  logarithm  of  each  term 


Z  —  exp 


WJi"^''"~A^oj 


has  a  normal  distribution  with  mean  (  —EJNq)  and  variance  lElN^,  the  moments  of  the 
distribution  can  be  found  from  Eq.  (152).  The  «th  moment  is 

E 


fi.yiZ'')  =  exp 


n(n  -  1) 


A^. 


oj 


It  follows  that  the  mean  of  each  term  is  unity,  and  the  variance  is 

'2E' 


a%{Z)  =  KZ^)  -  [^iiZf]  =  exp 


N, 


oJ 


-  1. 


(163) 


(164) 


The  variance  of  a  sum  of  independent  random  variables  is  the  sum  of  the  variances  of 
the  terms.  Therefore 

/2E] 

and  it  follows  that  the  variance  of  the  likelihood  ratio  is 


G%iMl)  =  M 


exp  I  - 


-  1 


(165) 


^-(^^4 


exp 


IE 


-  1 


(166) 


It  was  pointed  out  in  Section  4.9,  that  the  receiver  operating  characteristic 
curves  are  approximately  those  of  Fig.  2,  with 


d  =  ln(l  +  a%)  =  In 


1 


1 


^-M  +  M^^PliVo/J 


2E 


This  equation  can  be  solved  for  IEJNq. 

IE 
No 
*  The  reasoning  is  the  same  as  that  in  Section  4.4. 


ln[l  +  Mie'^  -  1)]. 


(167) 


(168) 
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Suppose  it  is  desired  to  keep  the  false  alarm  probability  and  probability  of  detec- 
tion constant.  This  requires  that  d  be  kept  constant.  Then  from  Eq.  (168)  it  can  be 
seen  that  if  the  number  of  possible  signals  M  is  increased,  the  signal  energy  ^must  also 
be  increased. 

4.1 1  Signal  which  is  one  of  M  orthogonal  signals  with  unknown  carrier  phase 

Consider  the  case  in  which  the  set  of  expected  signals  includes  just  M  different 
amplitude-modulated  signals  which  are  known  except  for  carrier  phase.  Denote  the 
signals  by 

sjlt)=fjlt)cos{o,t  +  e).  (169) 

It  will  be  assumed  further  that  the  functions /^(O  all  have  the  same  energy  £'and  are 
orthogonal,  i.e., 


i. 


Mt)Ut)dt=2Ed^„ 


(170) 


where  the  2  is  introduced  because  the/'s  are  the  signal  amplitudes,  not  the  actual  signal 
functions.  Also,  let  the  fj.{t)  be  band-limited  to  contain  no  frequencies  as  high  as  co. 
Then  it  follows  that  any  two  signal  functions  with  different  envelope  functions  will  be 
orthogonal.  Let  us  assume  also  that  the  distribution  of  phase,  d,  is  uniform,  and  that 
the  probability  for  each  envelope  function  is  1/M. 

With  these  assumptions,  the  hkelihood  ratio  can  be  obtained  from  Eq.  (56),  and 
it  is  given  by 

1  ^  1   /•27r     ri  i^  E 


l(x) 


exp 


J\  1  =  1  -''O. 


de, 


(171) 


where  Sj^i  are  the  sample  values  of  sj^t),  and  hence  depend  upon  the  phase  d.  The 
integration  is  the  same  as  in  the  case  of  the  signal  known  except  for  phase,  and  the 
result,  obtained  from  Eq.  (82),  is 


Hx) 


1     ^^ 


E 


oj 


where 


ric  = 


2  ^ifki^i)  cos  ojtA+  Q]  Xifkiti)  sin  cot 


(172) 


(173) 


Now  the  problem  is  to  find  o^jU).  The  variance  of  each  term  in  the  sum  in  Eq. 
(172)  can  be  found  since  the  distribution  function  with  noise  alone  can  be  found  in 
Section  4.5.  Since  the/j.(0  are  orthogonal,  the  distributions  of  the  r,.  are  independent, 
and  the  terms  in  the  sum  in  Eq.  (172)  are  independent.  Then  the  variance  of  the  likeli- 
hood ratio,  G%(1),  is  the  sum  of  the  variances  of  the  terms,  divided  by  M'~. 

The  distribution  function  for  each  term  exp  (— £/yVo)/o(r;,./A^)  is  given  in  Section 
4.5  by  Eqs.  (84)  and  (85).   If  a.  is  defined  by  the  equation 


P  =  exp 


E 


2E\ 


(174) 


then  the  distribution  function  in  the  presence  of  noise  for  each  term  in  Eq.  (172)  is 


Fi^\P)  =  exp 


(175) 
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The  mean  value  of  each  term  is 

J '00 
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Jo 


exp 


N, 


oJ 


2E 

No' 


a  exp 


d(x.  (176) 


This  can  be  evaluated  as  on  page  174  of  Threshold  Signals  [5],  and  the  result  is  that 
/.<^)(i3)  =  1. 

The  second  moment  of  each  term  is 


-J. 


/.<,^)(|S2)=        fi^dF'-J^Hfi), 


or 


/4'^  (^')  =       exp 
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(177) 
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The  integral  can  be  evaluated  as  in  Appendix  E  of  Part  II  of  reference  [17],  and  the 
result  is 


M 


M'r(^')  =  h\TF 


IE 


No 


The  variance  of  each  term  in  Eq.  (172)  is 


[<rW)f  =  ^^'^  (^')  - 1/^''-'  m'  =  io\j^^ 


IE 


-  1. 


It  follows  that  the  variance  of  Ml  is 

4(M/)  = 
and  therefore 

o\{l)  = 
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"    (IE 


IE 


Nn 


- 1 


- 1 


(178) 


(179) 


(180) 


(181) 


since  the  variance  for  the  sum  of  independent  random  variables  is  the  sum  of  the 
variances. 

If  the  approximation  described  in  Section  4.9  is  used,  the  receiver  operating 
characteristic  curves  are  approximately  those  of  Fig.  2,  with 


d  =  ln(l  +  al)  =  In 


1         1       /2E 


(182) 


4.12  The  broad  band  receiver  and  the  optimum  receiver 

A  few  applications  of  the  results  of  Section  4  are  suggested  in  Table  I,  Section 
4.1.  Two  further  examples  of  practical  knowledge  obtainable  from  the  theory  are 
presented  in  this  section  and  in  the  next. 

One  common  method  of  detecting  pulse  signals  in  a  frequency  band  of  width 
B  is  to  build  a  receiver  which  covers  this  entire  frequency  band.  Such  a  receiver  with  a 
pulse  signal  of  known  starting  time  is  studied  in  Section  4.7.  This  is  not  a  truly  opti- 
mum receiver;  it  would  be  interesting  to  compare  it  with  an  optimum  receiver.  We 
have  been  unable  to  find  the  distribution  of  likelihood  ratio  for  the  case  of  a  signal 
which  is  a  pulse  of  unknown  carrier  phase  if  the  frequency  is  distributed  evenly  over  a 
band.   However,  if  the  problem  is  changed  slightly,  so  that  the  frequency  is  restricted 
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to  points  spaced  approximately  the  reciprocal  of  the  pulse  width  apart,  then  pulses 
at  different  frequencies  are  approximately  orthogonal,  and  the  case  of  the  signal  which 
is  one  of  M  orthogonal  signals  known  except  for  phase  can  be  applied.  Eq.  (182) 
should  be  used  with  M  equal  to  the  ratio  of  the  frequency  band  width  B  to  the  pulse 
band  width.  Since  the  band  width  of  a  pulse  is  approximately  the  reciprocal  of  its  pulse 
width,  the  parameter  M  used  in  Section  4.7  also  has  this  value.  Curves  showing 
IEJNq  as  a  function  of  ^  are  given  in  Fig.  7  for  both  the  approximate  optimum  receiver 
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Comparison  of  optimum  and  broad  band  receivers. 
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and  the  broad  band  receiver  for  several  values  of  M.  In  the  figure,  d  is  calculated  from 
Eq.  (135)  and  Eq.  (182),  which  hold  for  large  values  of  M. 

4.13  Uncertainty  and  signal  detectability 

In  the  two  cases  where  the  signal  considered  is  one  of  M  orthogonal  signals,  the 
uncertainty  of  the  signal  is  a  function  of  M.  This  provides  an  opportunity  to  study  the 
effect  of  uncertainty  on  signal  detectability.  In  the  approximate  evaluation  of  the  opti- 
mum receiver  when  the  signal  is  one  of  M  orthogonal  functions,  the  ROC  curves  of 
Figs.  2  and  3  are  used  with  the  detection  index  d  given  by 


^  =  ln 


1         1  (IE 


(167) 


This  equation  can  be  solved  for  the  signal  energy,  yielding 


IE 

—  =  ln[l  -  M  +  Me^]  ^  In  M  +  \n{e^  -  1),  (175) 

the  approximation  holding  for  large  IE/Nq.  *  From  this  equation  it  can  be  seen  that 
the  signal  energy  is  approximately  a  linear  function  of  In  M  when  the  detection  index  d, 
and  hence  the  ability  to  detect  signals,  is  kept  constant.  It  might  be  suspected  that 
2EINq  is  a  linear  function  of  the  entropy,  —  S  ^^  Inp^,  where /»,:  is  the  probability  of  the 
ith  signal.  The  linear  relation  holds  only  when  all  the  pi  are  equal.  The  expression 
which  occurs  in  this  more  general  case  is : 

2E 

—  ^  -ln(2pf)+ln(e^-l).  (176) 
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FOUNDATIONAL  ASPECTS  OF  THEORIES  OF  MEASUREMENT  i 

DANA  SCOTT  and  PATRICK  SUPPES 

1.  Definition  of  measurement.  It  is  a  scientific  platitude  that  there 
can  be  neither  precise  control  nor  prediction  of  phenomena  without  measure- 
ment. Disciplines  as  diverse  as  cosmology  and  social  psychology  provide 
evidence  that  it  is  nearly  useless  to  have  an  exactly  formulated  quantitative 
theory  if  empirically  feasible  methods  of  measurement  cannot  be  developed 
for  a  substantial  portion  of  the  quantitative  concepts  of  the  theory.  Given 
a  physical  concept  like  that  of  mass  or  a  psychological  concept  like  that  of 
habit  strength,  the  point  of  a  theory  of  measurement  is  to  lay  bare  the  struc- 
ture of  a  collection  of  empirical  relations  which  may  be  used  to  measure  the 
characteristic  of  empirical  phenomena  corresponding  to  the  concept.  Why 
a  collection  of  relations?  From  an  abstract  standpoint  a  set  of  empirical 
data  consists  of  a  collection  of  relations  between  specified  objects.  For 
example,  data  on  the  relative  weights  of  a  set  of  physical  objects  are  easily 
represented  by  an  ordering  relation  on  the  set;  additional  data,  and  a 
fortiori  an  additional  relation,  are  needed  to  yield  a  satisfactory  quantitative 
measurement  of  the  masses  of  the  objects. 

The  major  source  of  difficulty  in  providing  an  adequate  theory  of  measure- 
ment is  to  construct  relations  which  have  an  exact  and  reasonable  numerical 
interpretation  and  yet  also  have  a  technically  practical  empirical  inter- 
pretation. The  classical  analyses  of  the  measurement  of  mass,  for  instance, 
have  the  embarrassing  consequence  that  the  basic  set  of  objects  measured 
must  be  infinite.  Here  the  relations  postulated  have  acceptable  numerical 
interpretations,  but  are  utterly  unsuitable  empirically.  Conversely,  as  we 
shall  see  in  the  last  section  of  this  paper,  the  structure  of  relations  which 
have  a  sound  empirical  meaning  often  cannot  be  succinctly  characterized 
so  as  to  guarantee  a  desired  numerical  interpretation. 

Nevertheless  this  major  source  of  difficulty  will  not  here  be  carefully 
scrutinized  in  a  variety  of  empirical  contexts.  The  main  point  of  the  present 
paper  is  to  show  how  foundational  analyses  of  measurement  may  be  grounded 
in  the  general  theory  of  models,  and  to  indicate  the  kind  of  problems  relevant 
to  measurement  which  may  then  be  stated  (and  perhaps  answered)  in  a 
precise  manner. 
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Before  turning  to  problems  connected  with  construction  of  theories  of 
measurement,  we  want  to  give  a  precise  set-theoretical  meaning  to  the 
notions  involved.  To  begin  with,  we  treat  sets  of  empirical  data  as  being 
(finitary)  relational  systems,  that  is  to  say,  finite  sequences  of  the  form 
51  =  <^,  Ri,  . . .,  R„y,  where  ^  is  a  non-empty  set  of  elements  called  the 
domain  of  the  relational  system  91,  and  R^,  .  .  .,  R^  are  finitary  relations 
on  A .  The  relational  system  31  is  called  finite  if  the  set  A  is  finite ;  otherwise, 
infinite.  It  should  be  obvious  from  this  definition  that  we  are  mainly 
considering  qualitative  empirical  data.  Intuitively  we  may  think  of  each 
particular  relation  R^  (an  w^-ary  relation,  say)  as  representing  a  complete 
set  of  "yes"  or  "no"  answers  to  a  question  asked  of  every  Wj-termed  se- 
quence of  objects  in  A.  The  point  of  this  paper  is  not  to  consider  that  aspect 
of  measurement  connected  with  the  actual  collection  of  data,  but  rather 
the  analysis  of  relational  systems  and  their  numerical  interpretations. 

If  s  =  <Wi,  .  .  . ,  w„>  is  an  w-termed  sequence  of  positive  integers,  then 
a  relational  system  31  =  <^,  R^,  . .  . ,  R„y  is  of  type  s  if  for  each  i  =  1 ,  .  .  .,  n 
the  relation  i?,-  is  an  w^-ary  relation.  Two  relational  systems  are  similar 
if  there  is  a  sequence  s  of  positive  integers  such  that  they  are  both  of  type  s. 
Notice  that  the  type  of  a  relational  system  is  uniquely  determined  only  if 
all  the  relations  are  non-empty;  the  avoiding  of  this  ambiguity  is  not 
worthwhile.  Suppose  that  two  relational  systems  31  =  <yl,  i?i,  .  .  .,  R^}  and 
S3  =  <-B,  5i,  .  .  .,  S„>  are  of  type  s  =  <Wi,  .  .  .,  w„>.  Then  $8  is  a  homo- 
morphic  image  of  31  if  there  is  a  function  /  from  A  onto  B  such  that,  for  each 
i=\,...,n  and  for  each  sequence  <%,...,  a^>  of  elements  of  A, 
Ri{a^,  .  .  .,  a„^)  if  and  only  if  Si(/(ai),  . . .,  f(a^)).  If  the  function  /  is  one- 
one,  then  S3  is  an  isomorphic  image  of  31,  or  simply  3t  and  S3  are  isomorphic. 
31  is  a  subsystem  of  SS  if  ^  5  -S  and,  for  each  i  —  1 ,  .  .  . ,  w,  the  relation  R^ 
is  the  restriction  of  the  relation  S,-  to  ^.  31  is  imheddahle  in  S3  if  some  sub- 
system of  S3  is  a  homomorphic  image  of  3t.  ^  A  numerical  relational  system 
is  simply  a  relational  system  whose  domain  of  elements  is  the  set  Re  of  all 
real  numbers.  A  numerical  assignment  for  a  relational  system  3t  with  respect 
to  a  numerical  relational  system  9'?  is  a  function  which  imbeds  31  in  5Z. 
A  numerical  assignment  is  not  required  to  be  one-one. 

Within  the  framework  of  the  preceding  formal  definitions  it  is  now 
possible  to  give  an  exact  characterization  of  a  theory  of  measurement. 
First  of  all  the  general  outlines  of  a  theory  are  determined  by  fixing  a 
finite  sequence  s  of  positive  integers  and  only  considering  relational  systems 
of  type  s.  Next  a  numerical  relational  system  91  of  type  s  is  selected  which 


*  Although  in  most  mathematical  contexts  imbeddability  is  defined  in  terms  of 
isomorphism  rather  than  homomorphism,  for  theories  of  measurement  this  is  too 
restrictive.  However,  the  notion  of  homomorphism  used  here  is  actually  closely 
connected  with  isomorphic  imbeddability  and  the  facts  are  explained  in  detail  in 
Section  2. 
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corresponds  to  the  intended  numerical  interpretation  of  the  theory,  and 
only  relational  systems  imbeddable  in  'j!fl  are  permitted.  Moreover  the 
theory  need  not  concern  all  relational  systems  of  type  s  imbeddable  in  '^ 
but  only  a  distinguished  subclass.  Since  it  is  reasonable  that  no  special  set  of 
objects  be  preferred,  we  require  that  the  distinguished  subclass  be  closed 
under  isomorphism.  We  thus  arrive  at  the  following  characterization  of 
theories  of  measurement  as  definite  entities:  a  theory  of  measurement  is 
a  class  K  of  relational  systems  closed  under  isomorphism  for  which  there 
exists  a  finite  sequence  s  of  positive  integers  and  a  numerical  relational 
system  9^  of  t5^e  s  such  that  all  relational  systems  in  K  are  of  type  s  and 
imbeddable  in  SfJ.  ^ 

Some  readers  may  object  that  the  definition  of  theories  of  measurement 
should  be  linguistic  rather  than  set-theoretical  in  character,  since  a  theory 
is  ordinarily  thought  of  as  a  linguistic  entity.  To  be  sure,  many  theories 
of  measurement  have  a  natural  formalization  in  first-order  predicate  logic 
with  identity.  Notice,  however,  that  first-order  axioms  by  themselves  are 
not  adequate,  for  if  they  admit  one  infinite  relational  system  as  a  model 
then  they  have  models  of  every  infinite  cardinality,  and  it  is  difficult  to  see 
how  any  natural  connection  can  be  established  between  numerical  models 
and  models  of  arbitrary  cardinality.  Even  neglecting  this  criticism  first- 
order  axioms  are  not  adequate  to  express  properties  involving  arbitrary 
natural  numbers,  for  example,  that  a  relational  system  is  finite  or  that  as 
an  ordering  it  has  Archimedean  properties.  Any  linguistic  definition  of 
theories  which  will  permit  expression  of  these  more  general  properties  would 
require  extensive  machinery  and  be  immediately  involved  in  some  of  the 
deepest  problems  of  modem  metamathematics.  On  the  other  hand,  we  do 
not  wish  to  give  the  impression  that  we  reject  any  linguistic  questions. 
In  fact,  we  use  our  set-theoretical  definition  as  a  point  of  departure  for  asking 
just  such  questions. 

On  the  basis  of  the  definition  of  theories  of  measurement  adopted,  two 
questions  naturally  arise,  to  each  of  which  we  devote  a  section.  In  the 
first  place,  is  a  given  class  of  relational  systems  a  theory  of  measurement? 
And  in  the  second  place,  given  a  theory  of  measurement,  in  what  sense  can 
it  be  axiomatized? 

2.  Existence  of  measurement.  A  simple  counterexample  shows  that 
not  every  class  of  relational  systems  of  a  given  type  closed  under  isomorphism 
is  a  theory  of  measurement.  Let  O  be  the  class  of  all  relational  systems  of 
type  <2>  that  are  simple  orderings.  Let  <,A,  Ry  be  a  system  in  O  where  R 

3  In  some  contexts  we  shall  say  that  the  class  K  is  a  theory  of  measurement  of  type 
s  relative  to  '^.  Notice  that  a  consequence  of  this  definition  is  that,  if  K  is  a  theory 
of  measurement,  then  so  is  every  subclass  of  K  closed  under  isomorphism.  Moreover, 
the  class  of  all  systems  imbeddable  in  members  of  K  is  also  a  theory  of  measurement 
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well-orders  A  and  A  has  a  power  not  equal  to  or  less  than  that  of  the  con- 
tinuum. Such  a  relational  system  can  be  proved  to  exist  even  without  the 
help  of  the  axiom  of  choice,  but  of  course  with  aid  of  this  axiom  the  existence 
is  obvious.  By  way  of  contradiction  suppose  that  O  is  a  theory  of  measure- 
ment relative  to  a  numerical  relational  system  <Re,  S>.  From  the  definition 
it  follows  that  <^,  Ry  is  imbeddable  in  <Re,  5>  and  that  there  is  a  numerical 
assignment  /  mapping  A  onto  a  subset  of  Re  such  that  xRy  if  and  only  if 
f{x)  S  f{y)  for  all  elements  x,y  e  A.  Let  a,  b  be  elements  of  A  such  that 
f{a)  =  f{b).  From  the  hypothesis  that  R  is  a.  simple  ordering,  we  can  assume 
without  loss  of  generality  that  aRb.  Hence,  we  have  f{a)  S  f{b),  and  then 
f{b)  S  f{a),  and  finally  bRa.  R  is  antis5nTimetric,  and  so  a  =  b.  This  argument 
shows  that  the  function  /  is  one-one.  Hence  A  has  the  same  power  as  a 
subset  of  Re,  which  is  impossible.  This  proof  shows  that  every  theory 
of  measurement  included  in  the  class  O  contains  only  relational  systems  of 
power  at  most  that  of  the  continuum.  It  is  an  unsolved  problem  of  set- 
theory  closely  connected  with  the  continuum  hypothesis  whether  the  class  O 
restricted  to  systems  of  power  at  most  that  of  the  continuum  is  actually 
a  theory  of  measurement.  *  At  least  it  can  be  very  easily  shown  that  O  so 
restricted  is  not  a  theory  of  measurement  relative  to  the  system  <Re,  ^>, 
where  the  relation  <.  is  the  usual  ordering  of  the  real  numbers.  ^  Indeed, 
the  exact  condition  that  a  relational  system  in  O  must  satisfy  to  be  im- 
beddable in  <Re,  ^>  is  not  really  elementary,  and  the  proof  of  the  necessity 
involves  the  axiom  of  choice.  ^ 

Let  O'  be  O  restricted  to  countable  relational  systems.  '  It  was  proved 
by  Cantor  that  O'  is  a  theory  of  measurement  relative  to  <Re,  <},  to  formu- 
late somewhat  irreverently  his  classical  result  in  the  terminology  of  this 
paper.  This  restriction  to  countable  relational  systems  is  always  sufficient. 
For  it  can  be  shown  that  the  class  of  all  countable  relational  systems  of  a 
given  type  is  a  theory  of  measurement;  however,  the  numerical  relational 
system  required  is  so  bizarre  as  to  be  of  no  practical  value. 

A  primary  aim  of  measurement  is  to  provide  a  means  of  convenient 
computation.  Practical  control  or  prediction  of  empirical  phenomena 
requires  that  unified,  widely  applicable  methods  of  analyzing  the  important 
relationships  between  the  phenomena  be  developed.  Imbedding  the  dis- 

*  In  this  connection  see  Sierpinski  [5],  Section  7,  pp.  141  ff.,  in  particular  Proposition 
C75,  where  of  course  different  terminology  is  used. 

^  It  is  sufficient  here  to  consider  a  relational  system  isomorphic  to  the  ordering  of 
the  ordinals  of  the  second  number  class  or  to  the  lexicographical  ordering  of  all  pairs 
of  real  numbers. 

®  A  simple  ordering  is  imbeddable  in  <Re,  ^  >  if  and  only  if  it  contains  a  countable 
dense  subset.  For  the  exact  formulation  and  a  sketch  of  a  proof,  see  Birkhoff  [1], 
pp.  31-32,  Theorem  2. 

''  The  word  'countable'  means  at  most  denumerable  and  it  refers  to  the  cardinality 
of  the  domains  of  the  relational  system.s. 
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covered  relations  in  various  numerical  relational  systems  is  the  most  im- 
portant such  unifying  method  that  has  yet  been  found.  But  among  the 
morass  of  all  possible  numerical  relational  systems  only  a  very  few  are  of 
any  computational  value,  indeed  only  those  definable  in  terms  of  the 
ordinary  arithmetical  notions.  From  an  empirical  standpoint  most  sets 
of  qualitative  data  can  find  numerical  interpretation  by  relations  defined 
in  terms  of  addition  and  ordering  alone.  By  way  of  example  we  may  cite 
the  measurement  of  masses,  distances,  sensation  intensities,  and  subjective 
probabilities.  Frequently  the  consideration  of  weighted  averages  requires 
also  the  use  of  the  multiplication  of  numbers.  However,  in  the  examples 
given  in  this  paper  we  shall  restrict  ourselves  to  the  notions  of  addition  and 
ordering. 

No  natural  scientific  situation  would  seem  strictly  to  require  the  con- 
sideration of  sets  of  infinite  data.  This  state  of  affairs  suggests  that  theories 
of  measurement  containing  only  finite  relational  systems  would  suffice  for 
empirical  purposes.  The  problem  is  delicate,  however,  for  the  measurement 
of  a  meteorological  quantity  such  as  temperature  by  an  automatic  recording 
device  is  usually  treated  as  continuous  both  in  its  own  scale  and  in  time. 
Yet  the  important  problem  of  measurement  does  not  really  lie  in  the  correct 
use  of  such  recording  devices  but  rather  in  their  initial  calibration,  a  process 
proceeding  from  a  finite  number  of  qualitative  decisions.  Because  of  the 
awkwardness  of  the  uniform  application  of  finite  relational  systems,  we 
shall  not  generally  make  this  restriction. 

Further  remarks  about  establishing  the  existence  of  measurement  are 
best  motivated  by  reference  to  a  concrete  example.  In  a  recent  paper  [4], 
Luce  has  introduced  a  generalization  of  simple  orderings  which  he  calls 
semiorders.  A  semiorder  is  a  relational  system  <^A ,  P>  of  type  <2>  which 
satisfies  the  following  axioms  for  all  x,  y,  z,  w  e  A: 

51.  Not  xPx. 

52.  //  xPy  and  zPw,  then  either  xPw  or  zPy. 

53.  //  xPy  and  zPx,  then  either  wPy  or  zPw.  ^ 

Such  relations  are  most  likely  to  occur  in  situations  where  objects  are  to 
be  arranged  in  order  and  where  it  is  difficult  to  say  exactly  when  two 
objects  are  indifferent.  For  example,  to  say  that  xPy  might  be  interpreted 
as  meaning  that  the  pitch  of  the  sound  x  is  definitely  higher  than  the  pitch 
of  y,  or  that  the  hue  of  color  x  is  definitely  brighter  than  the  hue  of  color  y, 
or  that  the  weight  of  the  object  x  is  noticeably  greater  than  that  of  y,  etc. 
Indifference  between  two  objects  x  and  y  (in  symbols:  xly)  is  defined  as 
not  xPy,  and  not  yPx.  The  point  of  Luce's  axioms  is  that  the  relation  /  of 


*  See  [4],  Section  2,  p.    181.  The  axioms  given  here  are  actually  a  simplification 
of  those  given  by  Luce. 
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indifference  is  not  always  transitive,  a  fact  easily  appreciated  for  each  of 
the  intuitive  interpretations  given  above. 

In  his  paper  Luce  gives  a  certain  numerical  interpretation  for  certain 
kinds  of  semiorders,  but  he  does  not  show  that  any  particular  class  of 
semiorders  is  a  theory  of  measurement  in  the  sense  used  here,  because  his 
interpretations  are  not  relative  to  a  fixed  numerical  relation.  However,  in 
the  finite  case  the  situation  becomes  relatively  simple.  Let  ^  be  that 
relation  between  real  numbers  defined  by  the  condition :  x^  y  if  and 
only  if  X  >  y+1-  Clearly,  if  x  and  y  are  real  numbers  such  that  x"^  y, 
then  it  is  fair  to  say  that  x  is  definitely  greater  than  y,  or  better,  x  is  noticeably 
greater  than  y.  It  is  in  fact  a  simple  exercise  to  prove  that  the  relational 
system  <Re,  >  >  is  a  semiorder.  Further  we  shall  give  the  proof  of  the 
following  result: 

The  class  of  finite  semiorders  is  a  theory  of  measurement  relative  to  the 
numerical  relational  system  <Re,  ^>. 

Before  presenting  the  proof  of  the  above,  it  would  be  well  to  outline  a 
general  method  in  proofs  of  the  existence  of  measurement  which  we  shall 
call  the  method  of  cosets.  Let  31  =  <yl,  R^,  .  .  .,  i?„>  be  a  relational  system 
of  type  <Wi,  ...,m„>.  A  uniquely  determined  equivalence  relation  E  is 
introduced  into  91  by  the  condition :  xEy  if  and  only  if  for  each  i  =  1 ,  .  .  . ,  n 
and  each  pair  (^z-^^,  .  . .,  z^y,  (Wi,  .  .  .,  w^^.y  oim^-termed  sequences  of  elements 
of  A,  if  Zj  7^  Wj  implies  {z^,  w^  =  [x,  y]  for  /  =  1 ,  . .  . ,  w^,  then  Riiz^,  .  .  . ,  z^^) 
if  and  only  if  Ri{w-^,  . . .,  w^^). 

Even  though  the  above  definition  is  complicated  to  state  in  general,  the 
meaning  of  the  relation  xEy  is  simple :  elements  x  and  y  stand  in  the  relation 
E  just  when  they  are  perfect  substitutes  for  each  other  with  respect  to  all 
the  relations  R^.  ^ 

The  notion  of  a  weak  ordering  can  serve  as  an  example.  Let  9t  =  <^,  i?> 
where  the  binary  relation  R  is  connected  and  transitive.  Then  xEy  is 
-equivalent  to  the  condition :  For  all  z  e  A,  xRz  if  and  only  if  yRz,  and  zRx 
if  and  only  if  zRy.  However,  this  simplifies  finally  to:  xRy  and  yRx. 

Returning  now  to  the  general  case,  define,  for  each  x  e  A,  [x]  to  be  the 
class  of  all  y  such  that  xEy.  [x]  is  called  the  coset  of  x.  Let  ^*  be  the  class 
of  all  [x]  ior  X  €  A.  Directly  from  the  definition  of  E  we  can  deduce  that  it  is 
permissible  to  define  w-ary  relations  Rf  over  A*  such  that,  for  all  Xj^,  .  .  ., 
x^^eA,  Rfiixj],  -...[x^^])  if  and  only  if  Ri{x-^,  ...,x^).  The  relational 
system  21*  =  iA*,R*,  .  .  . ,  R*y  is  called  the  reduction  of  2t  by  cosets. 

It  is  at  once  obvious  that  21*  is  a  homomorphic  image  of  21  and  that  2t** 
is  isomorphic  with  21*.  What  is  not  quite  obvious  is  the  following: 

//  S3  is  a  homomorphic  image  of  21,  then  21*  is  a  homomorphic  image  of  33. 


'  The  authors  are  indebted  to  the  referee  for  pointing  out  the  work  by  Hailperin 
in  [3]  which  suggested  this  general  definition. 
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By  way  of  proof,  let  /  be  a  homomorphism  of  91  onto  33.  We  wish  to 
show  that  if  f{x)  =  f{y),  then  [x]  =  [y].  Instead  of  the  general  case,  assume 
for  simplicity  that  91  and  35  are  of  type  <2>  and  91  =  <^,  i^>,  $8  =  <B,  5>. 
We  must  show  that  if  f{x)  —  f[y),  then  xEy,  or  in  other  words,  for  all  2:  e  ^, 
xRz  if  and  only  if  yRz,  and  zRx  if  and  only  if  zRy.  Assume  xRz.  It  follows 
that  f{x)  S  f{z),  and  hence  f{y)  S  f{z),  which  implies  that  yRz.  The  argument 
is  clearly  symmetric.  We  have  therefore  shown  that  there  is  a  function  g 
from  B  onto  A*  such  that  g{f{x))  =  [x]  tor  x  e  A.  It  is  trivial  to  verify  that 
g  is  a  homomorphism  of  39  onto  91*. 

Notice  the  following  relation  between  the  concepts  of  homomorphic 
image  and  subsystem :  if  93  is  a  homomorphic  image  of  91,  then  93  is  isomor- 
phic to  a  subsystem  of  91.  For  let  /  be  a  homomorphism  of  91  onto  35.  Let  g 
be  any  function  from  B  into  A  such  that  f{g{y))  =  y  for  all  y  e  B.  The 
restriction  of  91  to  the  range  of  g  yields  the  subsystem  of  91  isomorphic 
with  33. 

Using  the  above  remarks  we  can  establish  at  once  the  equivalence: 
91  is  imheddahle  in  35  if  and  only  if  91*  is  imbeddahle  in  35. 

Further,  it  follows  that  any  function  imbedding  91*  in  35  is  always  an 
isomorphism  of  91*  onto  a  subsystem  of  35,  and  of  all  homomorphic  images 
of  91  this  property  is  characteristic  of  91*. 

Let  K  now  be  any  class  of  relational  systems  closed  under  isomorphism. 
Let  K*  be  the  class  of  all  systems  isomorphic  to  some  91*  for  91  e  K".  In  effect 
we  have  shown  above: 

(i)  K  is  a  theory  of  measurement  relative  to  a  numerical  relational  system 
SfJ  if  and  only  if  K*  is  also. 

(ii)  If  K  in  addition  is  closed  under  the  formation  of  subsystems,  then  if  * 
is  the  class  of  all  systems  in  K  possessing  only  one-one  numerical  assign- 
ments. 

To  use  our  example  again,  if  K  is  the  class  of  weak  orders,  then  K*  is 
the  class  of  simple  orders.  Notice  that  the  proof  in  the  first  paragraph 
of  this  section  is  a  special  case  of  (ii). 

It  should  be  remarked  that  for  a  relational  system  91,  91  and  91*  always 
satisfy  exactly  the  same  formulas  of  first-order  logic  not  involving  the 
notion  of  identity.  Hence,  if  K  is  the  class  of  all  relational  systems  satisfying 
tirst-order  axioms  without  identity,  then  K*  is  the  class  of  all  systems 
satisfying  the  axioms  for  K  and  in  addition  satisfying  the  axiom: 

(*)     If  xEy,  then  x  ^=  y. 

The  application  of  this  remark  to  weak  orderings  and  simple  orderings 
is  left  to  the  reader. 

Consider  again  the  case  of  semiorders.  Let  S  be  the  class  of  all  finite 
semiorders.  For  any  <^ ,  P>  e  S,  consider  the  relation  /  of  indifference 
defined  above.  In  terms  of  /  one  can  establish  a  simplified  characterization 
of  E :  xEy  if  and  only  for  if  all  z  e  A,  xlz  if  and  only  if  ylz. 
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Introduce  (*)  as  a  new  axiom  S4.  The  class  of  all  21  e  .S  satisfying  S4 
is  just  the  class  S*.  Notice  that  unlike  the  pleasant  situation  with  weak 
orderings  and  simple  orderings,  the  class  S*  is  not  closed  under  the  formation 
of  subsystems  even  though  S  is. 

For  any  semiorder  <y4 ,  P>  introduce  a  further  relation  R  as  follows : 
xRy  if  and  only  if  for  all  z,  if  zPx  then  zPy,  and  if  yPz  then  xPz. 

We  leave  to  the  reader  the  elementary  verification  of  the  fact  that  R 
is  a  weak  ordering  of  A,  and  that  xEy  if  and  only  if  xRy  and  yRx.  Thus, 
if  <^A,  Py  €  S*,  then  R  is  a.  simple  ordering  of  A.  The  connection  between 
P  and  R  is  clearer  if  one  notices  that  xPy  implies  xRy,  and  that,  if  xRx^^, 
x^Pyi,  and  yiRy,  then  xPy. 

Now  let  SH  =  ^A ,  Py  he  a.  fixed  member  of  S*.  We  wish  to  show  that  9t 
has  an  assignment  in  <Re,  >  >.  Under  the  relation  R,  A  is  simply  ordered. 
Let  A  =  {xq,  .  . .,  x^}  where  XiRXi_-^  and  x^  ^  Xi_^.  Define  by  a  course  of 
values  recursion  a  sequence  Uq,  ...,«„  of  rational  numbers  determined 
uniquely  by  the  following  two  conditions: 

(1)     If  xJxq,  then  a,-  = 


i+1 


i  1 

(2)     If  xJXj  and  XiPXj_^  where  ^  >  0,  then  a^  = a^  + a^^^  +  1 . 

i-\-\  i-\-^ 

Notice  that  in  (2)  the  hypothesis  implies  that  /  ^  i,  while  in  the  case 
j  =  i  the  formula  for  a^  simplifies  to  a^.  =  flj._^ -f  z+1.  Notice  further 
that  every  element  x^  comes  either  under  (1)  or  (2);  for  letting  Xj  be  the 
first  element  such  that  x^Ix^,  there  are  two  cases:  /  =  0,  /  >  0.  Clearly 
we  always  have  a^  ^  0. 

We  show  first  that  a^  >  ai_i  by  induction  on  i.  For  case  ( I ) ,  this  is  obvious. 

Passing  to  (2),  assume  that  xjx^  and  XiPXj_-^^.  If  Xi_j^IxQ,  then  fl,_i  <  1 

while  flj  >  1.  Hence  we  can  assume  not  Xi_j^IxQ,  or  in  other  words  Xi_^PxQ. 

Let  x,^  be  the  first  element  such  that  Xi_-i^Ix,^  and  Xi_-i^Pxj^_-^.  By  definition 

i-\  1 

«!_!  =  - — : —  aj^  -\ — rcik-i  +  1-  If  ?  =  ^.  there  is  no  problem.  Assurhe  then 
t  t 

that  /  <  i.  Now  Xi_-i^RXj,  XiRXi_-i^,  and  xjx^,  hence  XjIXi_-^,  and  so  by  our 

choice  of  k  we  have  k  <,  j.  By  the  induction  hypothesis  on  i,  it  follows 

that  fl,.  >  aj_-^  and  a^.  >  a^_-^.  \i  k  =  j,  the  required  inequality  is  obvious. 

If  k  ^  /— I,  then  ai  >  flj_i  +  1.  Similarly  «i_i  <  a^  +  1,  but  again,  by 

the  induction  hypothesis,  a^.  ^  a,_i,  and  hence  a^  >  a^.j. 

The  next  step  is  to  prove  that,  if  x^Pxj^,  then  a^  >  a^.-\-\.  Let  x^  be  the 
first  element  such  that  xJXj  and  x^Px^^^^.  We  have  ^—1  >  k,  and,  in  view 
of  the  preceding  argument,  aj_i  ^  a,^.  But  aj_i+  1  <  a^,  whence  a^  >  aj.+  1 . 

Conversely  we  must  show  that,  if  a^  >  aj.+  1 ,  then  x^Px,,.  The  hypothesis 
of  course  implies  i  >  k.  Assume  by  way  of  contradiction  that  not  x^Px,.. 
It  follows  that  xjx^.  Let  x^  be  the  first  element  such  that  xjx.;  then 


220  READINGS  IN  MATHEMATICAL  PSYCHOLOGY 

k  '^  j  and  Uj^  ^  Uj.  If  /  =  0,  then  xJxq  and  x,JXfj,  because  x^Rxj^.  But 
then  0  <:  a^  <  1  and  0  ^  a^.  <  1,  which  contradicts  the  inequahty 
a,.  >  aj^-\-\.  We  can  conclude  that  /  >  0.  Now  a^  <  a,-fl,  but  a^.  ^  a,, 
and  thus  a^  <  cij^-\-\,  which  again  is  a  contradiction.  All  cases  have  been 
covered,  and  the  argument  is  complete. 

Finally  define  a  function  /  on  ^  such  that  f{Xi)  =  a^.  We  have  actually 
shown  that  /  imbeds  %  in  <Re,  ^>.  Thus  it  has  been  proved  that  S*  is  a 
theory  of  measurement  relative  to  <Re,  ^>,  and,  by  the  general  remarks 
on  the  method  of  cosets,  we  conclude  that  S  is  also  a  theory  of  measure- 
ment relative  to  <Re,  ^>. 

Notice  that  the  above  proof  would  also  work  in  the  infinite  case  as  long 
as  the  ordering  i?  is  a  well-ordering  of  type  co. 

Let  us  now  summarize  the  steps  in  establishing  the  existence  of  measure- 
ment using  as  examples  simple  orderings  and  semiorders.  First,  after  one 
is  given  a  class,  K  say,  of  relational  systems,  the  nummcal  relational 
system  should  be  decided  upon.  The  numerical  relational  system  should  be 
suggested  naturally  by  the  structure  of  the  systems  in  K,  and  as  was  re- 
marked, it  is  most  practical  to  consider  numerical  systems  where  all  the 
relations  can  be  simply  defined  in  terms  of  addition  and  ordering  of  real 
numbers.  Second,  if  the  proof  that  K  is  a  theory  of  measurement  is  not  at 
once  obvious,  the  cardinality  of  systems  in  K  should  be  taken  into  considera- 
tion. The  restriction  to  countable  systems  would  always  seem  empirically 
justified,  and  adequate  results  are  possible  with  a  restriction  to  finite 
systems.  Third,  the  proof  of  the  existence  of  measurement  can  often  be 
simplified  by  the  reduction  of  each  relational  system  in  K  by  the  method 
of  cosets.  Then,  instead  of  trying  to  find  numerical  assignments  for  each 
member  of  K,  one  concentrates  only  on  the  reduced  systems.  This  plan  was 
helpful  in  the  case  of  semiorders.  Instead  of  cosets,  it  is  sometimes  feasible 
to  consider  imbedding  by  subsystems.  That  is  to  say,  one  considers  some 
convenient  subclass  K'  C  K  such  that  every  element  of  K"  is  a  subsystem 
of  some  system  in  K' .  If  K'  is  a  theory  of  measurement,  then  so  is  K.  In 
the  case  of  semiorders  we  could  have  used  either  plan :  cosets  or  subsystems. 

After  the  existence  of  measurement  has  been  established,  there  is  one 
question  which  if  often  of  interest:  For  a  given  relational  system,  what 
is  the  class  of  all  its  numerical  assignments?  We  present  an  example. 

Consider  relational  systems  %  =  iA,Dy  of  type  <4>.  For  such  systems 
we  introduce  the  following  definitions:  xRy  if  and  only  if  xyDyy.  xyM^zw 
if  and  only  if  xyDzw,  zwDxy,  yRz  and  zRy.  xyM^+hw  if  and  only  if  there 
exist  u,  V  e  A  such  that  xyM^uv  and  uvM^zw. 

Let  H  be  the  class  of  all  such  relational  systems  which  satisfy  the  following 
axioms  for  every  x,  y,  z,  u,  v,  w  e  A: 

Al.     //  xyDzw  and  zwDuv,  then  xyDuv. 

A2.     xyDzw  or  zwDxy. 
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A3.     //  xyDzw,  then  xzDyw. 

A4.     //  xyDzw,  then  wzDyx. 

A5.     //  xRy  and  yzDuv,  then  xzDuv. 

A6.     There  is  a  z  e  A  such  that  xzDzy  and  zyDxy. 

A7.  //  not  xyDzw  and  not  xRy,  then  there  is  a  u  e  A  such  that  zwDxu, 
not  xRu,  and  not  uRy. 

A8.  //  xyDzw  and  not  xRy,  then  there  are  u,  v  e  A  and  an  n  such  that 
zuM'^vw  and  zuDxy. 

These  axioms  imply  that  for  a  system  9t  in  H,  the  relation  i?  is  a  weak 
ordering  of  A,  and  the  intuitive  interpretation  of  xyDzw  in  case  yRx  and 
wRz  is  that  the  interval  between  x  and  y  is  not  greater  than  the  interval 
between  z  and  w.  Making  heavy  use  of  the  last  three  existence  axioms,  it 
can  be  shown  that  H  is  a  theory  of  measurement  relative  to  the  numerical 
relational  system  <Re,  A>  where  A  is  the  quaternary  relation  defined  by 
the  condition  xy/!>izw  if  and  only  if  x—y  <  z—w  for  all  x,  y,  z,  w  e  Re.  It 
must  be  stressed  that  the  Archimedean  property  of  the  ordering  embodied 
in  A8  cannot  be  formulated  in  first-order  logic,  because  it  implies  that  all 
systems  in  H*  have  cardinality  not  more  than  the  power  of  the  continuum. 
In  addition,  it  can  be  shown  that,  if  31  is  in  H,  and  /  and  g  are  two  numerical 
assignments  of  91  relative  to  <Re,  A>,  then  /  and  g  are  related  by  a  positive 
linear  transformation ;  ^"  that  is,  there  exist  a,  /5  e  Re  with  a  >  0  such 
that,  for  all  x  e  Re,  f{x)  =  (xg{x)  +  ^.  This  gives  in  a  certain  sense  the  answer 
to  the  question  above:  If  we  know  one  numerical  assignment  for  21,  we 
know  them  all.  Except  for  very  special  systems  in  H,  nothing  more  specific 
can  really  be  expected. 

Notice  that  all  relational  systems  in  H  are  necessarily  infinite.  In  the 
next  section  we  shall  consider  in  detail  the  theory  of  measurement  F 
consisting  of  all  finite  relational  systems  imbeddable  in  <Re,  A>.  Here  the 
situation  is  quite  hopeless.  There  simply  is  no  apparent  general  statement 
that  can  be  made  about  the  relation  between  assignments.  In  as  much  as 
any  function  (p  which  imbeds  <Re,  A>  in  itself  is  necessarily  a  linear  trans- 
formation and  conversely,  it  follows  that,  if  91  is  a  system  in  F  and  /  is  an 
assignment  for  9t,  then  /  composed  with  a  linear  transformation  is  also  an 
assignment.  The  main  difficulty  with  F  is  that  two  assignments  for  the 
same  system  in  F  need  not  be  related  by  a  linear  transformation. 

3.  Axiomatizability.  Given  a  theory  of  measurement,  it  is  natural 
to  ask  various  questions  about  its  axiomatizability,  for  the  axiomatic 
analysis  of  any  mathematical  theory  usually  throws  considerable  light 
on  the  structure  of  the  theory.  In  particular,  given  an  extrinsic  characteri- 


^°  The  proofs  of  both  these  facts  about  H  are  very  similar  to  the  corresponding 
proofs  in  Suppes  and  Winet  [6]. 
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zation  of  a  theory  of  measurement  via  a  particular  numerical  relational 
system,  it  is  quite  desirable  to  have  an  intrinsic  axiomatic  characterization 
of  the  theory  to  be  able  better  to  recognize  when  a  relational  system  actually 
belongs  to  the  theory.  In  view  of  the  paucity  of  metamathematical  results 
concerning  the  axiomatics  of  higher-order  theories,  we  shall  restrict  our- 
selves to  the  problem  of  axiomatizing  theories  of  measurement  in  first- 
order  logic. 

It  is  a  well-known  result  that,  if  a  set  of  first-order  axioms  has  one 
infinite  model,  then  it  has  models  of  unbounded  cardinalities.  Since  for 
the  most  part  we  are  interested  in  one-one  assignments  with  values  in  the 
set  of  real  numbers,  unbounded  cardinalities  are  hardly  an  asset.  That  is  to 
say,  the  class  of  all  relational  systems  that  are  models  of  a  given  set  of 
first-order  axioms  is  usually  not  a  theory  of  measurement.  To  remove  such 
difficulties  without  having  to  understand  them,  we  simply  restrict  the 
cardinalities  under  consideration.  Even  a  restriction  to  finite  cardinalities 
is  not  too  strong  and  leads  to  some  rather  difficult  questions.  Thus  for  the 
remainder  of  this  section  we  shall  consider  only  finitary  theories  of  measure- 
ment, i.e.,  theories  containing  only  finite  relational  systems.  Such  a  theory 
is  called  axiomatizable,  if  there  exists  a  set  of  sentences  of  first-order  logic 
(the  axioms  of  the  theory)  such  that  a  finite  relational  system  is  in  the 
theory  if  and  only  if  the  system  satisfies  all  the  sentences  in  the  set.  A 
theory  is  finitely  axiomatizable  if  it  has  a  finite  set  of  axioms.  A  theory 
is  universally  axiomatizable  if  it  has  a  set  of  axioms  each  of  which  is  a  uni- 
versal sentence  (i.e.,  a  sentence  in  prenex  normal  form  with  only  universal 
quantifiers) . 

It  should  be  observed,  first,  that  any  finitary  theory  of  measurement 
is  axiomatizable.  This  is  no  deeper  than  saying  that  in  first-order  logic  we 
can  write  down  a  sentence  completely  describing  the  isomorphism  type  of 
each  finite  relational  system  not  in  the  given  theory,  and  clearly  the  nega- 
tions of  these  sentences  can  serve  as  the  required  set  of  axioms.  It  is  of 
course  quite  obvious  that  we  cannot  in  each  instance  give  an  effective  method 
for  writing  down  the  axioms,  since  there  are  clearly  a  continuum  number 
of  distinct  finitary  theories  of  measurement.  Notice  also  that  if  the  theory 
closed  under  subsystems  then  the  axioms  may  be  taken  as  universal  sen- 
tences, and  conversely.  In  case  one  considers  theories  consisting  of  all  finite 
relational  systems  imbeddable  in  a  given  numerical  relational  system,  then 
the  problem  of  a  recursive  or  effective  axiomatization  is  simply  the  problem 
of  whether  the  class  of  universal  sentences  true  in  the  given  numerical 
relational  system  is  recursively  enumerable  or  not.  It  is  not  difficult  to 
establish  that  this  last  problem  is  equivalent  to  the  problem  of  giving  a 
recursive  enumeration  of  all  the  relation  types  of  finite  relational  systems 
not  imbeddable  in  the  given  numerical  relational  system.  For  numerical 
relational  systems  whose  relations  are  definable  in  first-order  logic  in  terms 
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of  4-  and  ^,  these  problems  do  not  arise  since  the  first-order  theory  of 
-{-  and  ^  is  decidable,  and  it  is  to  these  relational  systems  that  we  shall 
primarily  restrict  our  further  attention. 

In  the  second  place,  in  all  domains  of  mathematics  a  finite  axiomatization 
of  a  theory  is  usually  felt  to  be  the  most  satisfactory  result.  No  doubt  the 
psychological  basis  for  such  a  feeling  rests  on  the  fact  that  only  a  finite 
characterization  can  in  one  step  explicitly  lay  bare  the  full  structure  of  a 
theory.  Of  course  an  extremely  complicated  axiomatization  may  be  of  little 
practical  value,  and  as  regards  theories  of  measurement  there  is  a  further 
complication.  Namely,  if  an  axiomatization  in  first-order  logic,  no  matter 
how  elegant  it  may  be,  involves  a  combination  of  several  universal  and 
existential  quantifiers,  then  the  confirmation  of  this  axiom  may  be  highly 
contingent  on  the  relatively  arbitrary  selection  of  the  particular  domain 
of  objects.  From  the  empirical  standpoint,  aside  from  the  possible  require- 
ment of  a  fixed  minimal  number  of  objects,  results  ought  to  be  independent 
of  an  exact  specification  of  the  extent  of  the  domain. 

We  are  thus  brought  to  our  third  observation:  A  finite  universal  axio- 
matization of  a  theory  of  measurement  always  yields  a  characterization 
independent  of  accidental  object  selection.  To  be  precise,  consider  a  fixed 
universal  sentence.  This  formula  will  obviously  contain  just  a  finite  number 
of  variables.  Hence,  to  verify  the  truth  of  the  sentence  in  a  particular 
relational  system,  we  need  consider  only  subsets  of  the  domain  of  a  uniformly 
bounded  cardinality.  Furthermore,  verification  for  each  subset  is  completely 
independent  of  any  relationships  with  the  complementary  set. 

Simple  orderings  and  semiorders  are  examples  of  this  last  point.  To 
determine  whether  a  finite  relational  system  of  type  <2>  is  a  simple  ordering, 
one  has  only  to  consider  triples  of  objects;  for  semiorders,  quadruples.  In 
constructing  an  experiment,  say,  on  the  simple  ranking  of  objects  with 
respect  to  a  certain  property,  the  design  is  ordinarily  such  that  connectivity 
and  antisymmetry  of  the  relation  are  satisfied,  because  for  each  pair  of 
objects  the  subject  is  required  to  decide  the  ranking  one  way  or  the  other, 
but  not  in  both  directions.  Analysis  of  the  data  then  reduces  to  searching 
for  "intransitive  triads". 

Vaught  [8]  has  provided  a  useful  criterion  for  certain  classes  of  relational 
systems  to  be  axiomatizable  by  means  of  a  universal  sentence.  A  straight- 
forward analysis  of  his  proof  yields  immediately  the  following  criterion  for 
finitary  theories  of  measurement. 

A  finitary  theory  of  measurement  K  is  axiomatizable  by  a  universal  sentence, 
if  and  only  if  K  is  closed  under  subsystems  and  there  is  an  integer  n  such  that, 
if  any  finite  relational  system  21  has  the  property  that  every  subsystem  of  91 
with  no  more  than  n  elements  is  in  K,  then  2(  is  in  K. 

Though  classes  of  finite  simple  orderings  and  finite  semiorders  are  two 
examples  of  finitary  theories  of  measurement  axiomatizable  by  a  universal 
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sentence,  there  are  interesting  examples  of  finitary  theories  of  measurement 
closed  under  subsystems  which  are  not  axiomatizable  by  a  universal  sen- 
tence. We  now  turn  to  the  proof  for  one  such  case. 

Let  F  be  the  class  of  all  finitary  relational  systems  of  type  <4>  imbeddable 
in  the  numerical  relational  system  <Re,  A>.  A  wide  variety  of  sets  of 
empirical  data  are  in  F.  In  fact,  all  sets  of  psychological  data  based  upon 
judgments  of  differences  of  sensation  intensities  or  of  differences  in  utility 
qualify  as  candidates  for  membership  in  F.  For  example,  in  an  experiment 
copcerned  with  the  subjective  measurement  of  loudness  of  n  sounds,  the 
appropriate  empirical  data  would  be  obtained  by  asking  subjects  to  compare 
each  of  the  n  sounds  with  every  other  and  then  to  compare  the  difference  of 
loudness  in  every  pair  of  sounds  with  every  other.  More  elaborate  inter- 
pretations are  required  to  obtain  appropriate  data  on  utility  differences  for 
individuals  or  social  groups  (cf.  Davidson,  Suppes  and  Siegel  [2],  Suppes  and 
Winet  [6]).  It  may  be  of  some  interest  to  mention  one  probabilistic  inter- 
pretation closely  related  to  the  classical  scaling  method  of  paired  compari- 
sons. Subjects  are  asked  to  choose  only  between  objects,  but  they  are  asked 
to  make  this  choice  a  number  of  times.  There  are  many  situations  in  which 
they  vacillate  in  their  choice,  and  the  probability  p^y  that  x  will  be  chosen 
over  y  may  be  estimated  from  the  relative  frequency  with  which  x  is  so 
chosen.  From  inequalities  of  the  form  p^y  <.  p^^  we  may  obtain  a  set  of 
empirical  data,  that  is,  a  finite  relational  system  of  type  <4>,  which  is  a 
candidate  for  membership  in  F.  The  intended  interpretation  is  that,  if 
P^y  >  I  and  p^^  ^  |,  then  p^y  ^  p^^  if  and  only  if  the  difference  in  sen- 
sation intensity  or  difference  in  utility  between  x  and  y  is  equal  to  or  less 
than  that  between  z  and  w,  the  idea  being,  of  course,  that  if  x  and  y  are 
closer  together  than  z  and  w  in  the  subjective  scale,  then  the  relative 
frequency  of  choice  of  x  over  y  is  closer  to  one-half  than  that  of  z  over  w. 

Before  formally  proving  that  the  theory  of  measurement  F  is  not  axio- 
matizable by  a  universal  sentence,  we  intuitively  indicate  for  a  relational 
system  of  ten  elements  the  kind  of  difficulty  which  arises  in  any  attempt 
to  axiomatize  F.  Let  the  ten  elements  be  a^,  .  . . ,  u^q  ordered  as  shown 
on  the  following  diagram  with  atomic  intervals  given  the  designations 
indicated. 

I    «1    I  «2  I    «3    I  «4  I Y I    /^l    I    /^2    I  /^3  I  i^4  I 

Let  a  be  the  interval  (a^,  a^),  let  /5  be  the  interval  (ag,  a^,),  and  let  y  be 
larger  than  a  or  /?.  We  suppose  further  that  a^,  a.^,  a.^,  a.^  is  equal  in  size  to 
/Sg,  Pi,  Pi,  ^3,  respectively,  but  a  is  less  than  /3.  ^^ 


"  Essentially  this  example  was  first  given  in  another  context  by  Herman  Rubin 
to  show  that  a  particular  set  of  axioms  is  defective. 
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The  size  relationships  among  the  remaining  intervals  may  be  so  chosen 
that  any  subsystem  of  nine  elements  is  imbeddable  in  <Re,  A>,  whereas 
the  full  system  of  ten  elements  is  clearly  not. 

Generalizing  this  example  and  using  the  criterion  derived  from  Vaught's 
theorem  we  now  prove : 

Theorem.  The  theory  of  measurement  F  is  not  axiomatizable  by  a  uni- 
versal sentence. 

Proof.  In  order  to  apply  the  criterion  of  axiomatizability  by  a  universal 
sentence,  we  need  to  show  that  for  every  n  there  is  a  finite  relational  system 
%  of  type  <4>  such  that  every  subsystem  of  9t  with  n  elements  in  its  domain 
is  in  F  but  %  is  not. 

To  this  end,  for  every  even  integer  n  =  2m  ^  10  we  construct  a  finite 
relational  system  51  of  type  <4>  such  that  every  subsystem  of  2m—  I  ele- 
ments is  in  F.  (A  fortiori  every  subsystem  of  2m— k  elements  for  k  <  2m 
is  in  F.)  To  make  the  construction  both  definite  and  compact,  we  take 
numbers  as  elements  of  the  domain  and  disrupt  exactly  one  numerical 
relationship.  Let  now  m  be  an  even  integer  equal  to  or  greater  than  10. 
The  selection  of  numbers  a^,  . .  . ,  a^m  may  be  most  easily  described  by 
specifying  the  numerical  size  of  the  atomic  intervals.  We  define  a^  = 
«i+i— «i  for  z  =  1,  .  ..,m—\  and  ^^  =  a^+.+i— «^+i  for  z  =  1,  .  .  .,  w— 1. 
We  then  set  a^  =  1,  a^  =  2*  for  i  =  1,  . .  .,  m—\,  and  a^+i  —  2^^.  In 
fixing  the  size  of  /S,-,  we  have  two  cases  to  consider  depending  on  the  parity 
of  m. 

Case  I.  m  is  even.  Then  m—\  is  odd,  and  we  set  ^^  =  a^/g  for  i  =  2, 
4,  .  .  .,  m—2  and  ^^  =  a-(rn+i-i)/2  ior  i  —  I,  3,  .  .  . ,  m—\. 

Case  2.  m  is  odd.  Then  m—\  is  even,  and  we  set  ^^  =  a^/2  for  i  =  2, 
4,  , .  .,  m—\  and/Sj  =  c(.(^^i)i^iori  =  1,3,  .  .  .,  m—2.  Thus  if  w  =  2m  =  12, 
we  have  a^  =  /Sg,  a.^  =  i^^,  ag  =  ^i,  cx.^  =  jSg,  (x^  =  p^.  With  the  set  A  = 
{«!,  .  .  . ,  ag^}  defined,  we  now  define  the  relation  D  as  the  expected  nu- 
merical relation  except  for  permutations  of  a^,  a^,  a^^j  and  flgm-  I^ 
X,  y,  z,  w  e  A  and  <.x,  y,  z,  w}  is  not  some  permutation  of  <.a-^,  a^,  a^^-^,  a^m), 
then  ix,  y,  z.wyeD  if  and  only  if 

(1)  X — y  ^  z — w. 

Moreover,  let  a  =  a-^^,  b  =  a^,  c  =  «„+!,  d  =  a^^.  Then  we  put  the  following 
nine  permutations  of  <a,  b,  c,  dy  in  D: 

ib,  a,  d,  c>  <«,  b,  d,  c>  <c,  b,  d,  a} 

ib,  d,  a,  c>  <a,  c,  d,  by  <c,  d,  a,  by 

<J),  d,  c,  ay  <a,  d,  c,  by  <c,  d,  b,  ay 

(These  nine  permutations  correspond  exactly  to  the  strict  inequalities 
following  from  b—a  <  d—c.  All  nine  are  needed  to  make  the  subsystems 
of  iA,  Dy  have  the  appropriate  properties.) 
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From  the  choice  of  the  numbers  in  A  and  the  definition  of  D  it  is  obvious 
that  iA,  Dy  is  not  imbeddable  in  <Re,  A>,  that  is,  that  iA,  Dy  is  not  in  F; 
for  the  atomic  intervals  between  a^  and  a^  must  add  up  to  a  length  equal 
to  the  sum  of  the  atomic  intervals  between  a^+j  and  a^^,  but  by  hypothesis 
the  interval  (a^,  a^  is  less  than  the  interval  {a^^-^,  a^.^.  It  remains  to  show 
that  every  subsystem  of  2m —  1  elements  is  in  F.  Two  cases  naturally  arise. 

Case  1 .  The  element  omitted  in  the  subsystem  is  a^,  a^,  a^^^  or  a^m- 
Then  the  nine  permutations  of  (2)  are  not  in  D  restricted  to  the  subsystem, 
and  the  subsystem  is  not  merely  imbeddable  in  <Re,  A>,  but  by  virtue  of  (1) 
is  a  subsystem  of  it. 

Case  2.  The  element  omitted  is  neither  a^,  a^,  a^+i  nor  a^ra-  Let  a^ 
be  the  element  not  in  the  subsystem.  There  are  two  cases  to  consider. 

Case  2a.  a^  <  a^.  For  this  situation  we  may  use  for  our  numerical 
assignment  the  function  /  defined  by  /(a,_j)  =  «t-i+l  for  /  =  1,  . .  .,i—\, 
f{ai^j)  —  CLi+j  for  ?  =  1.  ...,n—i.  It  is  straightforward  but  tedious  to 
verify  that  /  is  a  numerical  assignment,  that  is,  that  it  preserves  the  re- 
lation D  as  defined  by  (1)  and  (2).  Only  two  observations  are  crucial  to 
this  verification.  First,  regarding  atomic  intervals  (in  the  full  system),  if 
«i_,+i-a,_,.  =  a^+i-a^  for  k>i,  then  /(«,_,+i) -/(«,_,•)  =  (a^.^+i-l)- 
(a^_j— 1)  =  ^fc+i— aj.  =  f{cik+i)—f{^k)-  Second,  the  numbers  in  A  were  so 
chosen  that,  if  x,  y,  z,  w  €  A,  and  {z,w)  is  not  an  atomic  interval,  and 
(^x,  y)  7^  {z,  w)  and  x—y  ^  z—w,  then  x—y-\-2  ^  z—w.  Then  it  is  clear 
from  the  definition  of  /  that  f{x)  —  f{y)  ^  f{z)  —  f(w).  (Note  that  the 
above  implies  the  weaker  result  that  no  two  distinct  non  atomic  intervals 
have  the  same  size.) 

Case  2b.  ai>  a^-\-\.  Here  we  may  use  a  numerical  assignment  / 
defined,  as  would  be  expected  from  the  previous  case,  by  f{ai_>j  =  ai_j 
for  y  =  1,  .  .  .,  i—\,  fia-i+j)  =  ^j+j+l  for  /  =  1,  .  .  .,  n—i.  This  completes 
the  proof  of  the  theorem. 

It  would  be  pleasant  to  report  that  we  could  prove  a  stronger  result  about 
the  theory  of  measurement  F,  namely,  that  it  is  not  finitely  axiomatizable. 
Unfortunately,  there  seems  to  be  a  paucity  of  tools  available  for  studying 
such  questions  for  classes  of  relational  systems.  However,  we  would  like 
to  state  a  conjecture  which  if  true  would  provide  one  useful  tool  for  studying 
the  finite  axiomatizability  of  finitary  theories  of  measurement  like  F  which 
are  closed  under  submodels.  We  say  that  two  sentences  are  fmitely  equivalent 
if  and  only  if  they  are  satisfied  by  the  same  finite  relational  systems,  and 
we  conjecture:  //  S  is  a  sentence  such  that  if  it  is  satisfied  by  a  finite  model 
it  is  satisfied  by  every  submodel  of  the  finite  model,  then  there  is  a  universal 
sentence  finitely  equivalent  to  S.  If  this  conjecture  is  true,  it  follows  that 
any  finitary  theory  of  measurement  closed  under  submodels  is  finitely 
axiomatizable  if  and  only  if  it  is  axiomatizable  by  a  universal  sentence. 

The  proof  (or  disproof)  of  this  conjecture  appears  difficult.  It  easily  follows 
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from  Tarski's  results  [7]  on  universal  (arithmetical)  classes  in  the  wider 
sense  that,  if  the  finitistic  restrictions  are  removed  throughout  in  the 
conjecture,  the  thus  modified  conjecture  is  true;  for  the  class  of  relational 
systems  satisfying  S,  being  closed  under  submodels,  is  a  universal  class  in 
the  wider  sense  and  is  axiomatizable  by  a  denumerable  set  of  universal 
sentences.  Since  S  is  logically  equivalent  to  this  set  of  universal  sentences, 
it  is  a  logical  consequence  of  some  finite  subset  of  them;  but  because  it 
implies  the  full  set,  it  also  implies  the  finite  subset  and  is  thus  equivalent 
to  it. 

Our  conjecture  is  one  concerning  the  general  theory  of  models  and 
its  pertinence  is  not  restricted  to  theories  of  measurement.  In  conclusion 
we  should  like  to  mention  an  unsolved  problem  typical  of  those  which  arise 
in  the  special  area  of  measurement.  Let  R  be  any  binary  numerical  relation 
definable  in  an  elementary  manner  in  terms  of  plus  and  less  than.  Is  the  finitary 
theory  of  measurement  of  all  systems  imbeddable  in  R  finitely  axiomatizable? 
(If  our  conjecture  about  finite  models  is  true,  then  the  theory  of  measure- 
ment F  is  not  finitely  axiomatizable  and  shows  that  the  answer  to  this 
problem  is  negative  for  quaternary  relations  definable  in  terms  of  plus  and 
less  than.) 
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^lODELS  FOR  CHOICE-REACTION  TIME 

jMervyn  Stone 
medical  research  council* 

In  the  two-choice  situation,  the  Wald  sequential  probability  ratio 
decision  procedure  is  applied  to  relate  the  mean  and  variance  of  the  decision 
times,  for  each  alternative  separately,  to  the  error  rates  and  the  ratio  of 
the  frequencies  of  presentation  of  the  alternatives.  For  situations  involving 
more  than  two  choices,  a  fixed  sample  decision  procedure  (selection  of  the 
alternative  with  highest  likelihood)  is  examined,  and  the  relation  is  found 
between  the  decision  time  (or  size  of  sample),  the  error  rate,  and  the  number 
of  alternatives. 

This  paper  develops  to  the  point  of  usefulness  several  mathematical 
models  for  choice-reaction  time.  The  working  details  are  confined  to  ap- 
pendices and  only  definitions  and  results  appear  in  the  text.  It  is  hoped  that 
this  method  of  presentation  will  assist  the  reader  in  making  a  quick 
"calculated-observed"  analysis  of  the  data  he  may  have.  The  choice  of 
models  is  made  mainly  by  analogy  with  statistical  decision  procedures,  but 
no  model  is  presented  which  is  psychologically  unreasonable.  Also  no  com- 
parisons are  made  with  experimental  data  for  several  reasons:  (i)  the  paucity 
of  available  data  means  that  the  field  should  be  kept  open  to  avoid  premature 
rejections;  (ii)  published  data  are  often  summarized  in  directions  orthogonal 
to  our  interests;  (iii)  for  the  most  powerful  discrimination,  experiments  will 
need  to  be  designed  with  specific  models  in  mind. 

The  models  are  envisaged  as  applying  to  the  situation  in  which  the 
subject  (S)  is  given  a  time-stationary  stimulus  or  signal  and  is  required  to 
identify  some  attribute  of  the  signal  and  make  an  appropriate  reaction.  The 
signal  remains  present  until  the  reaction  ia  made.  S  is  presented  with  signal 
after  signal  and  the  successive  attributes  form  a  random  sequence;  that  is, 
for  a  given  run  of  signals,  the  attributes  of  different  signals  are  mutually 
independent  and  their  probabilities  of  presentation  do  not  change  with  time. 
The  models  assume  that  S  has  a  settled  mode  of  response.  They  will  be  hydro- 
dynamic  in  the  following  sense.  At  the  onset  of  each  signal,  a  stream  of 
information  about  the  signal  flows  at  a  uniform  rate  into  S.  After  a  certain 
time,  the  input  time,  the  front  of  this  stream  reaches  S's  decision  taking 
mechanism  or  "computer."  After  a  further  time,  the  decision  time,  S  makes 
a  response.  The  time  taken  for  the  response  to  be  recorded  will  be  called  the 
motor  time.  Thus  the  choice-reaction  time  is  made  up  of  three  components: 

*Applied  Psychology  Research  Unit,  15  Chaucer  Road,  Cambridge,  England. 
This  article  appeared  in  Psychometrika,  1960,  25,  251-260.    Reprinted  with  permission. 
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the  input  time,  T",  ;  the  decision  time,  T^  ]  the  motor  time,  Tm  ■  The  models 
apply  to  Td  ,  which  will  be  related  to  the  environmental  variables  (the  number 
of  signals  and  their  frequencies  of  presentation)  and  the  rate  at  which  S 
makes  incorrect  responses.  By  concentrating  on  T^  in  this  way,  it  is  not  implied 
that  Ti  and  Tm  are  necessarily  independent  of  these  factors. 

Likelihood  Ratio  Models  jar  the  Two-Choice  Situation 

It  is  assumed  that  the  subject  knows  when  the  signal  (either  Sq  or  Si  , 
say)  commences;  that  is,  he  knows  when  to  start  examining  the  stream  of 
information  arriving  at  the  computer.  (This  stream  is  "noisy"  until  the 
stream  from  the  signal  is  added  to  it.)  This  assumption  holds  in  the  self- 
paced  condition  and  also  when  some  preparatory  warning  signal  is  given.  It 
is  supposed  that  there  is  some  overlap  in  the  information;  that  is,  some 
patterns  of  information  may  arise  from  either  Sq  or  Si  .  If  there  is  no  un- 
certainty in  this  sense,  there  is  no  need  for  a  statistical  computer.  The  un- 
certainty may  arise  from  the  external  situation,  from  noise  added  at  the 
input  stage,  or  from  both  sources.  We  will  suppose  that  the  information  on 
which  »S's  computer  operates  is  equivalent  to  a  series  of  independent  random 
variables  at  short  time  intervals  t  and  that  each  random  variable  has  the 
(stationary)  distribution  of  a  random  variable  x  (dependent  on  which  signal 
has  occurred)  until  the  response  is  made. 

Signal 

Xi  X2  X3  


Let  Poix)  and  Pi(x)  be  the  probabilities  of  x  when  the  signal  is  So  and  Si  , 
respectively.  If  the  x's  are  instantaneous  samples  of  an  almost  continuous 
stream  of  information  then  the  assumption  of  independence  implies  zero 
auto-correlation  between  parts  of  the  stream  not  less  than  time  t  apart.  If 
the  x's  are  integrals  of  the  stream  over  the  successive  intervals,  then  the 
assumption  requires  zero  auto-correlation  for  all  time  lags  (or  at  least  for 
those  not  small  compared  with  t).  Suppose  the  computer  transforms  each 
x  to  a  quantity  c(x)  which  is  then  stored  in  an  adder. 

Sequential  Case 

The  computer  makes  a  running  total  of  c(xi),  0(^2),  •  •  •  .  Constant  log 
A  and  log  B  with  A  >  B  are  preselected  so  that  S  decides  for  So  (and  makes 
the  appropriate  motor  action)  as  soon  as  the  total  falls  below  log  B,  provided 
the  total  has  not  previously  exceeded  log  A  when  the  decision  would  have 
been  made  for  Si  .  (The  odd  way  of  expressing  the  constants  facilitates  later 
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references.)  If  the  decision  is  made  at  the  nth  sample  T^  =  nt.  The  theory 
of  the  sequential  probability  ratio  test  [1]  shows  that  the  optimum  choice 
of  the  function  c(x)  is 

(1)  c(x)  =  \ogp,(x)  -  logpo(a:). 

Such  a  function  implies  that  S  is  familiar  with  the  probability  distributions 
Po(x)  and  Pi(x).  Such  familiarity  may  be  the  result  of  a  process  of  learning, 
provided  S  has  performed  many  trials  of  the  discrimination  task  and  is 
given  knowledge  of  results.  *S's  computer  may  be  thought  of  as  exploratory, 
trying  out  different  c(x)'s  until  the  optimal  one  is  found.  However  it  is  con- 
ceivable that  the  distributions  can  be  deduced  by  S  from  the  structure  of 
the  situation  and  then  imposed  on  his  computer.  The  optimality  of  (1)  is 
stated  by  Wald  [1]  in  the  following  terms:  let  no  ,  ni  be  the  averages  of  the 
number  of  samples  necessary  for  decision  when  the  signals  presented  are 
So  ,  Si  ,  respectively.  If  n*o  ,  n%  are  the  averages  for  any  other  decision  pro- 
cedure based  on  xi ,  X2 ,  etc.,  with  equal  probabilities  of  incorrect  response 
to  So  and  Si ,  then  n*  >  fio  and  n*  >  ni .  It  is  possible  that  this  form  of 
optimality  does  not  appeal  to  S,  who  may  have  to  be  trained  to  use  it  by 
suitable  reward. 

Before  testing  the  model,  it  must  be  remembered  that  it  is  T  which  is 
measured  and  not  Td  .  Even  so,  a  test  is  available  which  requires  only  the 
following  assumption.  Consider  trials  leading  to  a  decision  for  So .  The  assump- 
tion is,  given  the  value  of  Tj  ,  that  the  distribution  of  Ti  +  T„  is  the  same 
whether  the  decision  is  right  or  wrong.  (The  same  assumption  is  made  for 
decisions  for  Si  .)  This  does  not  exclude  the  possibility  that  Ti  +  T„  and  Tj 
be  correlated.  The  length  of  time,  Ti  ,  may  affect  the  uncertainty  in  the 
information  presented  to  the  computer  and  therefore  may  affect  Ta  ;  alter- 
natively, if  Td  is  long,  T„  may  be  deliberately  shortened.  However,  it  does 
assume  that  T„  cannot  be  influenced  by  information  processed  since  the 
initiation  of  the  motor  action.  In  Appendix  1  it  is  shown  that,  with  mild 
restrictions  on  po{x)  and  Pi{x),  the  distribution  of  the  n's,  and  therefore  of 
the  Td's,  leading  to  a  decision  for  So  (or  of  those  leading  to  Si)  is  the  same 
whether  the  decisions  are  correct  or  incorrect.  With  the  above  assumption, 
this  implies  that  the  same  result  should  hold  for  a  comparison  of  the  correct 
and  incorrect  T's  leading  to  So  (and  for  a  comparison  of  those  leading  to  Si). 
This  provides  the  basis  of  a  reasonable  test  of  the  model.  However,  a  fair 
proportion  of  errors  would  be  needed  to  give  a  powerful  test. 

Without  making  assumptions  about  Po(x)  and  Pi(x),  it  is  difficult  to 
think  of  more  ways  of  examining  the  validity  of  the  model.  Since  x  is  an 
intervening  variable  without  operational  definition,  it  would  clearly  be 
unwise  to  assume  much  about  Po{x)  and  Pi(x).  However,  there  is  one  assump- 
tion,  called  the  "condition  of  symmetry,"  which  in  some  discrimination 
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tasks  may  be  reasonable.  This  is  that  the  distribution  of  pi(x)/po(x),  when 
X  is  distributed  according  to  po(x),  is  identical  with  that  of  po{x)/pi(x),  when 
X  is  distributed  according  to  pi{x).  It  is  shown  in  Appendix  2  that,  if  this 
condition  holds, 

(2)  n,/fk  =  J(^,a)/J(a,^); 
J(a,  I3)vi  -  J(I3,  a.)vo 

(3)  =  {/(^,  cc)a(l  -  a)[4nl  -  (m  -  noY] 

-  J(a,  ^)/3(l  -  ^)ml  -  (no  -  n,y]}/il  -  ex-  ^)\ 

where  a  and  /3  are  the  probabilities  of  incorrect  response  to  a  single  So  and  Si  , 
respectively,  Vi  is  the  variance  of  the  sample  sizes  when  Si  is  presented,  and 

J{a,  (3)  =  a  log  [a/a  -  /3)]  +  (1  -  a)  log  [(1  -  a)//3]. 

If  it  is  feasible  to  estimate  Td  directly  for  each  trial  by  eliminating  Tt  +  Tm 
from  T,  then  (2)  and  (3)  imply 

(4)  f.i/f.o  =  J{^,  a)/J{a,  iS), 
J  {a,  /3)  var  Tdi  -  J(0,  a)  var  Tdo  _ 

(5)  =   {J(I3,  a)a(l  -  a)[4n  "  (Tdi  -  TdoY] 

-  J(a,/3)^(1  -  ^)[4r^o  -  (Tdo  -  Tdry]}/(1  -  a  -  /3)^ 

Equations  (4)  and  (5)  are  most  relevant  if  S  can  be  persuaded  to  achieve 
different  (ex,  /3)  combinations  without  changing  th.e  distributions  Po(x)  and 
Pi(x).  When  a  =  /?,  then  Wo  =  Ui  and  Wo  =  t'l  I  with  the  assumptions  that 
Ti  +  T„  is  (i)  uncorrelated  with  T^  and  (ii)  independent  of  the  signal  pre- 
sented, this  implies  equality  of  means  and  variances  of  reaction  times  to  the 
signals.  So,  for  the  latter  special  case,  it  is  not  necessary  to  measure  T^  . 

For  the  "condition  of  symmetry"  it  is  sufficient  that,  with  x  represented 
as  a  number,  Po(x)  =  pi{x  —  d)  for  some  number  d  with  Po(x)  symmetrical 
about  its  mean.  This  might  occur  when  So  ,  Si  are  signals  which  are  close 
together  on  some  scale  and  the  error  added  to  the  signals  to  make  x  has  the 
same  distribution  for  each  signal.  Symmetry  would  not  be  expected  in 
absolute  threshold  discriminations  or  in  the  discrimination  of  widely  different 
colors  in  a  color-noisy  background.  Another  sufficient  condition  is  that  x  be 
bivariate,  [x(l),  x{2)],  the  probabilities  under  Sq  obtained  from  those 
under  Si  by  interchanging  x{l)  and  x(2).  For  instance,  x{l)  and  x{2)  may 
be  the  inputs  on  two  noisy  channels  and  So  consists  of  stimulation  of  the  first 
while  Si  consists  of  stimulation  of  the  second. 

A  further  prediction  of  the  model  for  the  symmetrical  case  can  be  made 
when  S  is  persuaded  by  a  suitable  reward  to  give  equal  weight  to  errors  to 
So  and  Si  ,  that  is  to  minimize  his  unconditional  error  probability,  by  adjust- 
ment of  the  constants  A  and  B  in  his  computer.  If  po  is  the  frequency  of 
presentation  of  So  then  the  error  probability  is  po  «  +  (1  —  Po)/5  or  e,  say, 
and  the  average  decision  time  is  Po^do  +  (1  —  Po)Tdi  or  Td  ,  say.  It  is  shown 
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in  Appendix  3  that,  provided  lOe  <  po  <  1  —  lOe,  the  minimization  results 
in  the  following  relation  between  Ta  ,  e  and  po  : 

Td  cc  J(e,  e)  -  J(po,  Po). 
The  Kon-Sequeniial  Fixed-Sample  Case 

If  S  has  an  incentive  to  react  quickly  and  correctly,  then  the  advantage 
of  the  sequential  decision  procedure  is  that  those  discriminations  which  by 
chance  happen  to  be  easy  are  made  quickly  and  time  is  saved.  However  it 
is  possible  that  S  may  adopt  a  different,  less  efficient  strategy — which  is  to 
fix  Tj  for  all  trials  at  a  value  which  will  give  a  certain  accepted  error  rate. 
Let  the  sample  size  corresponding  to  this  decision  time  be  n.  The  likelihood 
ratio  procedures  are  as  follows:  decide  for  So  if  c(xi)  +  •  •  •  +  c{x„)  <  log  C; 
decide  for  Sj  if  c{xi)  +  •  •  •  +  c(x„)  >  log  C;  c{x)  =  log  Pi{x)  —  log  po{x) 
and  C  >  0.  These  procedures  are  optimal  in  the  sense  that,  if  any  other 
procedure  based  on  a;i  ,  •  •  •  ,  a;„  is  used,  there  exists  one  of  the  likelihood  ratio 
procedures  with  smaller  error  probabilities.  It  was  remarkable  that  in  the 
sequential  case  useful  predictions  were  obtainable  under  mild  restrictions 
on  po(x)  and  piix).  Unfortunately  this  does  not  hold  for  the  fixed-sample 
case,  making  more  difficult  the  problem  of  testing  whether  such  a  model 
holds. 

If  there  is  no  input  storage,  it  is  possible  that  the  results  of  the  self- 
imposed  strategy  just  outlined  are  equivalent  to  those  obtainable  when  the 
experimenter  himself  cuts  off  the  signals  after  an  exposure  time  Td  .  But 
this  is  the  type  of  situation  considered  by  Peterson  and  Birdsall  [2].  The 
emphasis  of  these  authors  is  mainly  on  the  external  parameters  (such  as 
energy)  rather  than  on  any  supposed  intervening  variable.  They  define  a 
set  of  phj^sical  situations  for  auditory  discrimination  in  terms  of  a  parameter 
d,  which  is  equivalent  to  the  difference  between  the  means  of  two  normal 
populations  with  unit  variance.  (For,  in  the  cases  considered,  it  happens 
that  the  logarithm  of  the  likelihood  ratio  of  the  actual  physical  random 
variables  for  the  two  alternatives  is  normally  distributed  with  equality  of 
variance  under  the  two  alternatives.)  This  parameter  sets  a  limit  to  the  various 
performances  (error  probabilities  to  So  and  Si)  of  any  discriminator  using  the 
whole  of  the  physical  information.  It  therefore  sets  an  upper  bound  on  the 
performance  of  S  who  can  only  use  less  than  the  whole.  In  [2]  the  authors 
make  the  assumption  that  the  information  on  the  basis  of  which  S  makes 
his  discrimination  nevertheless  gives  normality  of  logarithm  of  the  likelihood 
ratio.  They  examine  data  to  &3e  whether  S  is  producing  error  frequencies 
that  lie  on  a  curve  defined  hy  a  d  greater  than  that  in  the  external  situation. 

More  than  Two  Alternatives 
For  m  alternatives  there  are  m  probability  distributions  for  the  inter- 
vening variable  x  (which  may  be  multivariate);  that  is,  signal  s,-  induces  an 
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X  with  the  probabUity  distribution  pi(x)  ior  i  =  1,  •  •  •  ,  m.  We  will  consider 
the  consequences  of  a  fixed-sample  decision  procedure  based  on  .Xi  ,  •  •  •  ,  x„  , 
where  n  is  fixed. 

If  the  signals  are  presented  independently  with  probabilities  Pi  ,  •  •  •  ,  Pm 
(adding  to  unity)  and  if  q:.(D)  is  the  probability  of  error  to  signal  s.  when  the 
decision  procedure  D  (based  on  a;i  ,  •  •  •  ,  .t„)  is  used,  then  the  probability 
of  error  to  a  single  presentation  is 

e  =   J2pM^)- 

It  is  shown  in  Appendix  4  that  the  3D  minimizing  e  is  that  which  effectively 
selects  the  signal  with  maximum  posterior  probability.  In  this  section,  this 
minimum  e  will  be  related  to  n  (or  T^/t)  and  m  when  distributions  are  normal. 
However  in  the  validation  of  the  model  it  might  be  necessary  to  supplement 
Td  with  a  time  T^  ,  representing  the  time  the  computer  requires  to  examine 
the  m  posterior  probabilities  to  decide  which  is  the  largest.  For,  although  it 
might  be  reasonable  to  suppose  that  Ti  -f  T^  is  independent  of  w,  one  would 
expect  Ta  to  vary  with  m.  The  simplest  model  for  Ta  would  be  to  suppose 
that  Ta  =  (m  —  1)^',  where  t'  is  the  time  necessary  to  compare  any  two  of 
the  probabilities  and  decide  which  is  the  larger. 

We  will  state  the  relation  between  n  and  m  when  e  is  constant  in  the 
following  special  case  (treated  by  Peterson  and  Birdsall  [3],  who  stated  the 
relation  between  e  and  m  when  n  is  held  constant  by  the  experimenter):  we 
take  Pi  =  P2  =  '  •  •  =  p„  =  1/m  and  x  a  multivariate  random  variable 
x{l),  •  •  •  ,  x(m).  Under  «.•  ,  suppose  that  x{l),  •  •  •  ,  x(m)  are  independent 
and  that  x(i)  is  normally  distributed  with  mean  n  >  0  and  unit  variance, 
while  the  other  components  of  x  are  normal  with  zero  means  and  unit  vari- 
ances. Thus  there  is  all-round  symmetry.  a;(l),  •  •  •  ,  x(m)  can  be  regarded 
as  the  inputs  on  m  similar  channels.  The  iih  channel  is  stimulated  under  Si  . 
It  is  readily  seen  that  the  optimal  procedure  is  to  choose  the  signal  correspond- 
ing to  the  channel  with  the  largest  total.  It  is  shown  in  Appendix  5  that, 
with  this  procedure, 

nn'  =  {1  +  [0.64(m  -  l)"'''  -f-  0.45]'} [$"'(1  -  e)  -  ^-\l/m)]' 

for  those  m  for  which  e  <  1  —  (1/m).  $"'  is  the  inverse  of  the  normal  standard- 
ized distribution  function.  The  values  of  ni/  for  certain  values  of  e  and  m 
have  been  calculated.  If  m  is  independent  of  m,  then  Td  is  proportional  to 
n(j.^  and  the  results  are  plotted  in  Figure  1.  It  can  be  seen  that  T^  is  very 
nearly  linear  against  log  m,  which  agrees  with  some  experimental  findings 
in  this  field. 

The  question  may  be  raised  whether  any  w-choice  task  can  obey  the 
symmetry  condition  of  the  model.  Peterson  and  Birdsall  apply  the  model 
to  the  case  where  an  auditory  signal  is  presented  in  one  of  four  equal  periods 
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Figure  1 
The  Decision  Time  (Ta)  for  Error  Rate  (e)  and  Number  of  Equally  Likely  Alternatives  (m) 

of  an  exposure  of  S  to  "white"  noise.  In  this  case  symmetry  is  superficially 
present,  but  any  memory  difficulties  of  S  would  upset  it.  We  would  not 
expect  the  model  to  apply  to  the  case  of  response  to  one  of  m  fairly  easily 
discriminable  lights  arranged  in  some  display,  for  the  noise  would  be  highly 
positional.  However,  in  the  case  where  the  lights  are  patches  of  white  noise 
on  one  of  which  a  low  intensity  visual  signal  is  superimposed  so  that  response 
is  difficult,  the  positional  effect  may  not  be  important  and  there  may  be 
symmetry. 

Appendix  1 

Let  n,y  be  the  sample  size  for  a  decision  in  favor  of  «,•  when  s,  is  presented. 
The  distribution  of  n,^  is  completely  determined  by  its  characteristic  function, 
rPij .  From  A5.1  of  [1],  if 

<f>iii)  =   12Piix)[pi(x)/poix)Y, 


then 

(6) 

(7) 


(1  -  a)JBVoo[-log0o(O]  +  aAVio[-log0o(O]  ^  1, 
/35Voi[-log0i(O]  +  a  -  ^)AVn[-log</.,(0]  ^  1, 

provided  the  quantities  Ei  ,  F<  defined  in  Appendix  2  are  small.  If  a  <  0.1 
and  iS  <  0.1  then  to  a  good  approximation  A  =  (I  —  ^)/a  and  B  =  /8/(l  —  a). 
Now  </>o  (1  +  w)  =  ^i(w);  so,  putting  t  =  1  +  w  in  (6)  and  (7), 

/35Voo[-log0i(i^)]  +  (1  -  ^)AV,o[-log<^i(w)]  =  1, 

(1   -  a)SVo:[-log0o(w)]  +  aAVn[-log<^o(w)]  ^  L 
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By  comparing  these  equations  with  (6)  and  (7),  it  is  found  that  i^oo  =  'A°i 
and  ^10  =  \pn .  Therefore  the  distributions  of  noo  and  noi  (and  similarly  those 
of  nio  and  uu)  are  identical. 

Appendix  2 
In  the  case  of  symmetry, 

X  Po{x)  log  [po{x)/pi(x)]  =  J2  Pi(x)  log  [pi(x)/Po(x)]  =  E, 
and 
var  log  [Pq{x)/p\{x)]  under  paix)  =  var  log  [Pi{x)/pQ{x)]  under  Pi{x)  =  V. 

From  A:12  oi[l],\i  E  and  V  are  small, 

(8)  no  =  J{a,  ^)/E;       n^  =  J(/3,  a)/E. 

Therefore 

ni/flo  =  /((8,  a) /J (a,  /3). 

By  differentiating  (6)  twice  with  respect  to  t  and  substituting  t  =  0,  using 
(8)  and  the  fact  that  lAy  is  the  characteristic  function  of  riij , 

^  VJ{oc,  /3)       a(l  -  oc)[^n\  -  (ni  -  n,y] 

'°        E'  {I -a-  &y 

By  symmetry 

_  F/(/3,  g)       ^(1  -  ^)ml  -  (no  -  fiO^] 

""'        E'  a-  a-  py 

Hence 

J{a,  P)vi  -  /(/3,  a)yo 

=  {,1(13,  a)«(l  -  a)[4n?  -  (fii  -  no)2] 

-  J(«,  /3)/3(l  -  |3)[4n^  -  (no  -  niy]]/(l  -  a  -  ^y  . 

Appendix  3 

If  a  <  0.1  and  ^  <  0.1  then,  by  (8),  Ta  oc  p(J{a,  /3)  +  (1  -  po)/(i3,  a). 
Keeping  e  [or  po  «  +  (1  —  Po)/3]  constant  at  a  value  in  the  range  given  by 
lOe  <  po  <  1  —  lOe,  the  condition  on  a  and  /3  will  be  satisfied.  It  is  found  by 
the  usual  methods  that  the  minimum  Td  is  proportional  to  J{e,  e)  —  J{po,  Po)- 

Appendix  4 

Let  X  be  the  set  of  all  possible  values  of  a:  =  (xi ,  •  •  •  ,  Xn)  and  Xi  the 
set  of  X  for  which  a  decision  is  made  for  Si  .  Then 

TO  

«  =  Z)  Pi   S  p.(^)- 

1  =  1        xfX  —  Xi 

Suppose  Xi  and  Xj  have  a  common  boundary;  then,  for  e  to  be  a  minimum, 
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it  will  not  be  changed  by  small  displacements  in  this  boundary.  Hence,  on 
the  boundary,  PiPi{x)  =  PiPi(x);  that  is,  the  posterior  probability  of  s< 
equals  that  of  s,-  .  Considering  all  possible  boundaries,  the  solution  is  that 
Xi  is  the  set  of  x's  for  which  s,-  has  greater  posterior  probability  than  the 
other  signals. 


Write 


Appendix  5 


^(*)  =   1]  xXi)/n. 


Then,  under  Si  ,  \/nx{l)  is  A^(\/n/i,  1)  and  \/nx{i)  is  A^(0,  1)  for  i  t^  1. 
Therefore, 

ai(2D)  =   •  •  •   =  arrk!^ 

=  1  -  (27r)-  '^^  j     mi)r-'  exp  [-Kw  -  aA  fjif]  du. 

On  integration  by  parts, 
(9)         e  =   Z  pM^) 

=  (w  -  l)(27r)-'''  f     ^ur-'^(u  -  Vnix)  exp  (-iw')  du 

J  —00 

say,  where  6  —  -s/nn.  Peterson  and  Birdsall  [3]  use  this  form  as  the  basis  of 
their  tabulation.  However  e„(0)  — >  0  as  5  -^  oo  and  e„(^)  — >  1  as  5  -^  —  oo  ; 
while  ei{6)  <  0.  Therefore  \ei{d)\  is  a  "probability  density  function"  for  6. 
The  characteristic  function  and  hence  the  distribution  of  6  turns  out  to  be 
the  same  as  that  oiv  -\-  w,  where  w  =  max  (wj  ,  •  •  •  ,  v^^i)  and  v,Vi ,  •  •  •  ,  y„_i 
are  m  independent  standard  normal  variables.  Referring  to  Graph  4.2.2(7) 
of  [4],  it  can  be  seen  that,  for  m  <  20,  the  first  and  second  moment  quotients 
of  w  are  not  very  different  from  those  of  a  normal  distribution.  Also  the 
addition  of  v  to  w  will  improve  normality.  Hence  d  is  approximately  normal, 
agreeing  with  the  calculations  of  Peterson  and  Birdsall.  If  6  is  N(v,  o-^),  we 
determine  v  and  o-^  as  follows.  From  (9),  e„(0)  =  1  —  (1/m).  Also  e„(0)  = 
1  —  $  (—  v/cr).  Therefore 

v/a-  =  -$"'(l/m). 

Also  0-^  =  var  v  +  var  w  and  from  Graph  4.2.2(6)  of  [4],  var  w  = 
[0.64  (m  -  1)~*  +  0.45]'  for  m  <  20,  which  determines  o•^  Putting  e„{d)  =  e, 
the  constant  error  rate, 

nti"  =  {1  +  [0.64(m  -  ly'''  +  0.45f  }[$-'(!  -  e)  -  *-'(l/m)f. 
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STATISTICAL  INFERENCE  ABOUT  MARKOV  CHAINS 
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Summary.  Maximum  likelihood  estimates  and  their  asymptotic  distribution 
are  obtained  for  the  transition  probabilities  in  a  Markov  chain  of  arbitrary 
order  when  there  are  repeated  observations  of  the  chain.  Likelihood  ratio  tests 
and  x^-tests  of  the  form  used  in  contingency  tables  are  obtained  for  testing  the 
following  hypotheses:  (a)  that  the  transition  probabilities  of  a  first  order  chain 
are  constant,  (b)  that  in  case  the  transition  probabilities  are  constant,  they  are 
specified  numbers,  and  (c)  that  the  process  is  a  wth  order  Markov  chain  against 
the  alternative  it  is  rth  but  not  wth  order.  In  case  u  —  0  and  r  =  1,  case  (c) 
results  in  tests  of  the  null  hypothesis  that  observations  at  successive  time  points 
are  statistically  independent  against  the  alternate  hypothesis  that  observations 
are  from  a  first  order  Markov  chain.  Tests  of  several  other  hypotheses  are  also 
considered.  The  statistical  analysis  in  the  case  of  a  single  observation  of  a  long 
chain  is  also  discussed.  There  is  some  discussion  of  the  relation  between  likeli- 
hood ratio  criteria  and  x'-tests  of  the  form  used  in  contingency  tables. 

1.  Introduction.  A  Markov  chain  is  sometimes  a  suitable  probability  model 
for  certain  time  series  in  which  the  observation  at  a  given  time  is  the  category 
into  which  an  individual  falls.  The  simplest  Markov  chain  is  that  in  which 
there  are  a  finite  number  of  states  or  categories  and  a  finite  number  of  equi- 
distant time  points  at  which  observations  are  made,  the  chain  is  of  first-order, 
and  the  transition  probabilities  are  the  same  for  each  time  interval.  Such  a 
chain  is  described  by  the  initial  state  and  the  set  of  transition  probabilities; 
namely,  the  conditional  probability  of  going  into  each  state,  given  the  im- 
mediately preceding  state.  We  shall  consider  methods  of  statistical  inference 
for  this  model  when  there  are  many  observations  in  each  of  the  initial  states 
and  the  same  set  of  transition  probabilities  operate.  For  example,  one  may  wish 
to  estimate  the  transition  probabilities  or  test  hypotheses  about  them.  We  de- 
velop an  asymptotic  theory  for  these  methods  of  inference  when  the  number  of 
observations  increases.  We  shall  also  consider  methods  of  inference  for  more 
general  models,  for  example,  where  the  transition  probabilities  need  not  be  the 
same  for  each  time  interval. 

An  illustration  of  the  use  of  some  of  the  statistical  methods  described  herein 
has  been  given  in  detail  [2].  The  data  for  this  illustration  came  from  a  "panel 
study"  on  vote  intention.  Preceding  the  1940  presidential  election  each  of  a 
number  of  potential  voters  was  asked  his  party  or  candidate  preference  each 
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month  from  May  to  October  (6  interviews) .  At  each  interview  each  person  was 
classified  as  RepubUcan,  Democrat,  or  "Don't  Know,"  the  latter  being  a  residual 
category  consisting  primarily  of  people  who  had  not  decided  on  a  party  or 
candidate.  One  of  the  null  hypotheses  in  the  study  was  that  the  probability  of 
a  voter's  intention  at  one  interview  depended  only  on  his  intention  at  the  im- 
mediatelj''  preceding  interview  (first-order  case),  that  such  a  probabiUty  was 
constant  over  time  (stationarity) ,  and  that  the  same  probabilities  hold  for  all 
individuals.  It  was  of  interest  to  see  how  the  data  conformed  to  this  null  hy- 
pothesis, and  also  in  what  specific  ways  the  data  differed  from  this  hypothesis. 

This  present  paper  develops  and  extends  the  theory  and  the  methods  given 
in  [1]  and  [2].  It  also  presents  some  newer  methods,  which  were  first  mentioned 
in  [9],  that  are  somewhat  different  from  those  given  in  [1]  and  [2],  and  explains 
how  to  use  both  the  old  and  new  methods  for  dealing  Avith  more  general  hy- 
potheses. Some  corrections  of  formulas  appearing  in  [1]  and  [2]  are  also  given 
in  the  present  paper.  An  advantage  of  some  of  the  new  methods  presented 
herein  is  that,  for  many  users  of  these  methods,  their  motivation  and  their 
application  seem  to  be  simpler. 

The  problem  of  the  estimation  of  the  transition  probabilities,  and  of  the  test- 
ing of  goodness  of  fit  and  the  order  of  the  chain  has  been  studied  by  Bartlett 
[3]  and  Hoel  [10]  in  the  situation  where  only  a  single  sequence  of  states  is  ob- 
served; they  consider  the  asymptotic  theory  as  the  number  of  time  points 
increases.  We  shall  discuss  this  situation  in  Section  5  of  the  present  paper,  where 
a  x"-test  of  the  form  used  in  contingency  tables  is  given  for  a  hypothesis  that  is 
a  generahzation  of  a  hypothesis  that  was  considered  from  the  likelihood  ratio 
point  of  view  by  Hoel  [10]. 

In  the  present  paper,  we  present  both  likelihood  ratio  criteria  and  x^-tests, 
and  it  is  shown  how  these  methods  are  related  to  some  ordinary  contingency 
table  procedures.  A  discussion  of  the  relation  between  likelihood  ratio  tests 
and  x^-tests  appears  in  the  final  section. 

For  further  discussion  of  Markov  chains,  the  reader  is  referred  to  [2]  or  [7]. 

2.  Estimation  of  the  parameters  of  a  first-order  Markov  chain. 

2.1.  The  model.  Let  the  states  be  i  =  1,  2,  •  •  -jm.  Though  the  state  i  is 
usually  thought  of  as  an  integer  running  from  1  to  m,  no  actual  use  is  made  of 
this  ordered  arrangement,  so  that  i  might  be,  for  example,  a  political  party,  a 
geographical  place,  a  pair  of  numbers  (a,  b),  etc.  Let  the  times  of  observation 
he  t  =^  0,  1,  ■  ■  ■  ,  T.  Let  Pij(t)  {i,  j  =  1,  ■  •  •  ,  m;  t  =  1,  •  ■  ■  ,  T)  he  the  proba- 
bility of  state  j  at  time  t,  given  state  i  at  time  ^  —  1 .  We  shall  deal  both  with 
(a)  stationary  transition  probabilities  (that  is,  Pij{t)  =  pij  ior  t  =  I,  ■  ■  ■  ,  T) 
and  with  (b)  nonstationary  transition  probabilities  (that  is,  where  the  transition 
probabihties  need  not  be  the  same  for  each  time  interval).  We  assume  in  this 
section  that  there  are  ni(0)  individuals  in  state  i  Sit  t  =  0.  In  this  section,  we 
treat  the  Wi(0)  as  though  they  were  nonrandom,  while  in  Section  4,  we  shall 
discuss  the  case  where  they  are  random  variables.  An  observation  on  a  given 
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individual  consists  of  the  sequence  of  states  the  individual  is  in  at  <  =  0,  1 ,  •  •  •  , 
T,  namely  iiO),  i(\),  i{2),  •  •  •  ,  i{T).  Given  the  initial  state  i(0),  there  are  m'" 
possible  sequences.  These  represent  mutually  exclusive  events  with  probabilities 

(2.1)  Pi{0)ia)    PiWHi)    •  ■   ■    PHT-l)i(T) 

when  the  transition  probabilities  are  stationary.  (When  the  transition  prob- 
abilities are  not  necessarily  stationary,  symbols  of  the  form  pm-DUD  should  be 
replaced  by  pm-Diwit)  throughout.) 

Let  nij{t)  denote  the  number  of  individuals  in  state  i  &t  t  —  1  and  j  at  t. 
We  shall  show  that  the  set  of  nij(t)  {i,  j  =  1,  •  •  ■  ,  m;  t  =  1,  •  •  •  ,  T),  a  set 
of  rnT  numbers,  form  a  set  of  sufficient  statistics  for  the  observed  sequences. 
Let  ni(o)i(i)...i(r)  be  the  number  of  individuals  whose  sequence  of  states  is  2(0), 
i(l),  •  •  •  ,  i(T).  Then 

(2.2)  n,y(0  =  Z)n,-(0),(i)...,(7-)  , 

where  the  sum  is  over  all  values  of  the  z's  with  i(t  —  1)  =  g  and  i(t)  =  j.  The 
probability,  in  the  nmT  dimensional  space  describing  all  sequences  for  all  n 
individuals  (for  each  initial  state  there  are  nT  dimensions),  of  a  given  ordered 
set  of  sequences  for  the  n  individuals  is 

n  [p.(0).-a)(l)  p.-(i)i(2)(2)  •  •  •  p.(r-i)i(r)(r)r^-<°'^"'-— ^ 

=  (nb.(o)m)(i)]"^'"^^''^-"^"0  ••■  (n[p.-(.-i).-(r)(r)r'°^^<''---'^') 

(2.3)  =  (    U     PminAir'"''''''')  ■  ■  ■  (      n       Pi,T-.u<rATr''-'''''''"^ 

\i(0),i{l)  /  \i(T-l).i(T)  / 

=  ^^p.•(0"-■'^ 

where  the  products  in  the  first  two  lines  are  over  all  values  of  the  T  -{-  1  indices. 
Thus,  the  set  of  numbers  nij{t)  form  a  set  of  sufficient  statistics,  as  announced. 
The  actual  distribution  of  the  nij(t)  is  (2.3)  multiplied  by  an  appropriate 
function  of  factorials.  Let  ni(t  —  1)  =  2J7=i  nij{t).  Then  the  conditional 
distribution  of  nij{t),  j  =  1,  ■  ■  ■  ,m,  given  ni(t  —  1)  (or  given  nk{s),  k  =  1,  •  ■  •  , 
m;  s  =  0,  •  •  •  ,  ^   —  1)  is 

(2.4)  '^4^^^^-Updtr'''. 

This  is  the  same  distribution  as  one  would  obtain  if  one  had  ni{t  —  1)  obsei-va- 
tions  on  a  multinomial  distribution  with  probabilities  Pij(t)  and  with  resulting 
numbers  nij(t).  The  distribution  of  the  nij(t)  (conditional  on  the  ^^(0))  is 


(2.5)  n  n 


'"4^^^UpAtr'''' 

n  ndt) 


J=l 


L  J=i 
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For  a  Markov  chain  with  stationary  transition  probabilities,  a  stronger  result 
concerning  sufficiency  follows  from  (2.3);  namely,  the  set  n,;  =  ^J=i  Uijit) 
form  a  set  of  sufficient  statistics.  This  follows  from  the  fact  that,  when  the 
transition  probabiHties  are  stationary,  the  probability  (2.3)  can  be  written 
in  the  form 

(2.6)  nnp;r''  =  npr. 

'=1    o,i  i,j 

For  not  necessarily  stationary  transition  probabilities  Vai^),  the  nij{t)  are  a 
minimal  set  of  sufficient  statistics. 

2.2.  Maximum  likelihood  estimates.  The  stationary  transition  probabilities 
Pij  can  be  estimated  by  maximizing  the  probability  (2.6)  with  respect  to  the 
Pij ,  subject  of  course  to  the  restrictions  pa  ^  0  and 

(2.7)  T^pij  =  1,  i  =  1,2,  •••  ,w, 

when  the  Uij  are  the  actual  observations.  This  probability  is  precisely  of  the 
same  form,  except  for  a  factor  that  does  not  depend  on  pi, ,  as  that  obtained 
for  m  independent  samples,  where  the  ith.  sample  {i  =  1,  2,  ■  ■  ■  ,  m)  consists  of 
nf  =  zli'i^ii  multinomial  trials  with  probabilities  pij  (i,  j  =  1,  2,  •  •  •  ,  m).  For 
such  samples,  it  is  well-known  and  easily  verified  that  the  maximum  likelihood 
estimates  for  pij  are 

T  m       T 

pij  =  riij/n*  =  X)  nij(t)/  ^  53  w*(0 

(=1  A:=l   /=1 

(2-8) 

and  hence  this  is  also  true  for  any  other  distribution  in  which  the  elementary 
probability  is  of  the  same  form  except  for  parameter-free  factors,  and  the  re- 
strictions on  the  Pij  are  the  same.  In  particular,  it  applies  to  the  estimation  of 
the  parameters  pij  in  (2.6). 

When  the  transition  probabilities  are  not  necessarily  stationary,  the  general 
approach  used  in  the  preceding  paragraph  can  still  be  applied,  and  the  maximum 
likelihood  estimates  for  the  Pij{t)  are  found  to  be 

/m 
Z  nikit). 
k=l 

The  same  maximum  likelihood  estimates  for  the  Pait)  are  obtained  when  we 
consider  the  conditional  distribution  of  nij{t)  given  ni(t  —  1)  as  when  the  joint 
distribution  of  the  niy(l),  nij(2),  •  •  •  ,  nij{T)  is  used.  Formally  these  estimates 
are  the  same  as  one  would  obtain  if  for  each  i  and  t  one  had  n,(i  —  1 )  observa- 
tions on  a  multinomial  distribution  with  probabilities  Pij{t)  and  with  resulting 
numbers  nij{t). 
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The  estimates  can  be  described  in  the  following  way:  Let  the  entries  nij{t) 
for  given  t  be  entered  in  a  two-way  m  X  m  table.  The  estimate  of  pait)  is  the  i, 
jih.  entry  in  the  table  divided  by  the  sum  of  the  entries  in  the  ith  row.  In  order 
to  estimate  ptj  for  a  stationary  chain,  add  the  corresponding  entries  in  the  two- 
way  tables  for  ^  =  1,  •  •  •  ,  T,  obtaining  a  two-way  table  with  entries  n,j  = 
Yltriijit).  The  estimate  of  pa  is  the  i,  jth  entry  of  the  table  of  n,y's  divided  by 
the  sum  of  the  entries  in  the  ith.  row. 

The  covariance  structure  of  the  maximum  likelihood  estimates  presented  in 
this  section  will  be  given  further  on. 

2.3.  Asymptotic  behavior  of  nij(t).  To  find  the  asymptotic  behavior  of  the 
Pij ,  first  consider  the  n,y(i).  We  shall  assume  that  nk{0)/2l  wy(0)  — >  77^ 
(Vk  >  0,^  r]k  =  1)  as  Zl  nj{0)  ->  00.  For  each  i{0),  the  set  n,(0),(i)..w(r)  are 
simply  multinomial  variables  with  sample  size  n,(o)(0)  and  parameters 
Pi(0)i(i)  Pi(i)i(2)  •  •  •  Pi{T-\)i{T)  ,  and  hence  are  asymptotically  normally  distributed 
as  the  sample  size  increases.  The  nvXO  are  linear  combinations  of  these  multi- 
nomial variables,  and  hence  are  also  asymptotically  normally  distributed. 

Let  P  =  (pij)  and  let  p,-y'  be  the  elements  of  the  matrix  P'.  Then  p]'/  is  the 
probability  of  state  j  at  time  t  given  state  i  at  time  0.  Let  nk-ij(t)  be  the  number 
of  sequences  including  state  k  at  time  0,  i  at  time  t  —  1  and  j  at  time  t.  Then 
we  seek  the  low  order  moments  of 

(2.10)  Uijit)  =  ^nk;ij{t)- 

The  probability  associated  with  nk-ij{t)  is  pi'"^'  pa  with  a  sample  size  of  nt(0). 
Thus 

(2.11)  &nk;ij{t)  =  nk{Q)pkf'^Pii , 

(2.12)  Var{n,.;,XO}  =  n,(0)pir"po[l  -  Pkf'^P.A, 

(2.13)  Cov {nk.,i At),  nk;gK{t)\  =  -n,(0)?)ir'VoWr'V(7A,  ih  J)  ^  (S'.  ^). 

since  the  set  of  nk;ij{t)  follows  a  multinomial  distribution.  Covariances  between 
other  variables  were  given  in  [1]. 

Let  us  now  examine  moments  of  nk;ij{t)  —  rik-iH  —  \)pii ,  where  nk-i{t  —  1)  = 
Z!y  i^k;ij{t);  they  will  be  needed  in  obtaining  the  asymptotic  theory  for  test 
procedures.  The  conditional  distribution  of  nk-ij{t)  given  nk-i{t  —  1)  is  easily 
seen  to  be  multinomial,  with  the  probabilities  pa  .  Thus, 

(2.14)  Z[nk;ii{t)  I  Uk-.iit  -  1)}  =  PijUk-iit  -  1), 

8{n4;,X0  -  nk:i{t  -  l)pij] 
(2.15) 

=   SS{[w,.;,;(0  -  nk-i{t  -  l)po]  I  nk:i{t  -  1)!  =  0. 
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The  variance  of  this  quantity  is 

&[nk;ij{t)  -  rik.iit  -  1)  pijf 

=  88{[n,;,v(0  -  nk-i{t  -  1)  pijf  \  «,;,(^  -  1)} 
(2.16) 

=     &nk;i{t     -     1)    Pij{l     -    Pij) 

=  rikiO)  pir^'  Pij(l  -  Pij). 
The  covariances  of  pairs  of  such  quantities  are 

&[nk-ijit)  —  iik-iit  —  1)  Pij\[nk-ih{t)  —  nk-i{t  —  1)  p.u] 
(2.17)        =  &&{[nk■,^i{t)  -  Uk-iit  -  l)pi,][nk;ih(t)  -  n,;,(<  -  l)Pik]  \  nk.,,{t  -  l)\ 
=  Z[-nk-i{t  —  1)  PijPih]  =  -Wfc(O)  pkf^^  PijPih ,  j  ^  h, 

&[nk-ij{t)  —  nk;i(t  —  l)pij][nk;gh(t)  —  rik-git  —  1)  pg,]  I 

=  S8{[Wi-;i>(0  -  nk-i{t  -  \)pij\{nk-gh{t)  —  nk-g{t  -  \)pg}] 
(2.18) 

I  nk-,{t  -  1),  Uk-git  -  l)j 

=  0,  z  =  ^. 

8[Mi.;o(0  -  nk;i{,t  -  l)p^j\\nk-gh{t  +  r)  -  nk-g(t  +  r  -  l)pgi,] 

=  8g{K-;iX0  -  nk;r(t  -  l)pij][nfc;0A(^  +  r)  -  Mi.;g(i  +  r  -  l)pg/J 
(2.19) 

I     Wt;g(^     +      r      -       1),     ni-;i(^      "       l),     ni;iy(0| 

=  0,  r  >  0. 

To  summarize,  the  random  variables  nk;ij(t)  —  nk;i{t  —  l)pij  for^  =  1,  ■  •  •  , 
m  have  means  0  and  variances  and  covariances  of  multinomial  variables  with 
probabilities  pij  and  sample  size  nkiO)pki~^\  The  variables  Uk-ijit)  —  Uk-jit  —  l)pij 
and  nk;gh(s)  —  nu-gis  —  l)pgh  are  uncorrected  ii  t  ^  s  or  i  ^  g. 

Since  we  assume  Wi(0)  fixed,  nk-ait)  and  ni-gh(t)  are  independent  if  A-  ^  l. 
Thus 

(2.20)  8[niy(0  -  Uiit  -  l)pij]  =  0, 

m 

(2.21)  8[n,,(0  -  Uiit  -  l)pijf  =  T.  nk(0)pl'~'^  pijd  -  pij), 

8[wi,(0  -  Uiit  -  l)pij][nih{i)  -  riiit  -  l)pih] 

(2.22) 

=  —  Hnk(Q)pki  ^^  PijPih,    j  7^  h, 

k=l 

(2.23)       8[Wij(0  -  n,{t  -  l)pij][ngh{s)  -  ng(s  -  l)pgh]  =  0,       t  9^  s  or  i  9^  g. 
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2.4.  The  as3ntnptotic  distribution  of  the  estimates.  It  will  now  be  shown  that 

when  w  — >  00 , 


VniPa  -  Pij)  =  Vr 


~           T 

<=1 

T                                             P'] 

Z  n.(t  -  1) 

-    T 

Z  [ndt)  -  PijUiit  -  1)] 
(=1 

T 

E  n,{i  -  1) 
(=1 

-   m        T 
k=l    (  =  1 

- 1)] 

7" 

'=1 

1) 

(2.24)  =  Vr 


Vw 


has  a  limiting  normal  distribution,  and  the  means,  variances  and  covariances 
of  the  hmiting  distribution  will  be  found.  Because  nk-ait)  is  a  multinomial 
variable,  we  know  that 

(2.25)  nk;ij(t)/n  ^  [nk-ij{t)/nk{0)]rik 

converges  in  probability  to  its  expected  value  when  nk(0)/n  — >  rjk  .  Thus 

1    ^  1       ^ 

p  Hm  -  Z  ni(t  —  1)  =  lim  -  S  Z  ^'(^  ~  1) 

n-»oo  Tl   t  =  l  n-»oo  fl         t  =  l 

(2.26) 

=    Z  '?*  Z  Pki    ^  . 

Therefore  n'^  (p,y  —  ?>{;•)  has  the  same  limit  distribution  as 

(2.27) 


Z  [nij(t)  -  PijUiit  -  l)]/n''^ 


Z  Z  -nk  Pk'j 

k=l  <=1 

(see  p.  254  in  [6]). 

From  the  conclusions  in  Section  2.3,  the  numerator  of  (2.27)  has  mean  0  and 
variance 

Z  nijit)  -  PijUiit  -  1)      /  w  =  Z  Z  w,(0)pir''  Pijil  -  Pij)/n. 

(  =  1  J    /  k=l   (  =  1 

The  covariance  between  two  different  numerators  is 

Z  '^ijit)  -  PijUiit  -  1)      Z  ngh(i)  -  PghUgit  -  I)     /  n 


(2.29) 

where  8ig  =  0  ii  i  ^  g  and  Su  =  1. 


=   -8,g  Z  Z  nkiO)  pl'i  ^'  PijPah/n, 
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Let 

m       T 

(2.30)  E  Z  Vk  pir"  =  0i . 

fr=l  (=1 

Then  the  limiting  variance  of  the  numerator  of  (2.27)  is  <i>i  Pij{l  —  Pij),  and 
the  limiting  covariance  between  two  different  numerators  is  —dig  4>i  pij  pgh . 
Because  the  numerators  of  (2.27)  are  linear  combinations  of  normalized  multi- 
nomial variables,  with  fixed  probabilities  and  increasing  sample  size,  they  have 
a  limiting  normal  distribution  and  the  variances  and  covariances  of  this  limit 
distribution  are  the  limits  of  the  respective  variances  and  covariances  (see,  e.g., 
Theorem  2,  p.  5  in  [4]). 

Since  n'^  {pa  —  pa)  has  the  same  limit  distribution  as  (2.27),  the  variables 
n"^  {pa  —  Pij)  have  a  limiting  joint  normal  distribution  with  means  0,  variances 
Pij(l  —  pij)/0i  and  the  covariances  —5i(,pijP9ft/</»i.  The  variables  (ri4^i)^'^(pij  —  pij) 
have  a  limiting  joint  normal  distribution  with  means  0,  variances  pij(l  —  pa) 
and  covariances  —digPaPgh  •  Also,  the  set  {ntf^  {pa  —  pij)  has  a  limiting 
joint  normal  distribution  with  means  0,  variances  Pij(l  —  pa)  and  covariances 
—^igPiiPoh  ,  where  n*  =  ^Jlo  ni(t). 

In  other  terms,  the  set  {n(t>if  {pa  —  pa)  for  a  given  i  has  the  same  limiting 
distribution  as  the  estimates  of  multinomial  probabilities  pa  with  sample  size 
rK/>i ,  which  is  the  expected  total  number  of  observations  Ui  in  the  iih.  state  for 
t  =  Q,  ■  •  ■  ,  T  —  \.  The  variables  {ru^if^  {pa  —  pa)  for  m  different  values  of  i 
(i  =  1,  2,  •  •  •  ,  m)  are  asymptotically  independent  (i.e.,  the  limiting  joint 
distribution  factors),  and  hence  have  the  same  limiting  joint  distribution  as 
obtained  from  similar  functions  of  the  estimates  of  multinomial  probabilities 
Pij  from  m  independent  samples  with  sample  sizes  n({)i  (i  =  1,  2,  •  •  •  ,  m).  It 
will  often  be  possible  to  reformulate  hypotheses  about  the  pij  in  terms  of  m 
independent  samples  consisting  of  multinomial  trials. 

We  shall  also  make  use  of  the  fact  that  the  variables  pij{t)  =  nij(t)/ni(t  —  1) 
for  a  given  i  and  t  have  the  same  asymptotic  distribution  as  the  estimates  of 
multinomial  probabilities  with  sample  sizes  &ni{t  —  1),  and  the  variables  Pij{t) 
for  two  different  values  of  i  or  two  different  values  of  t  are  asymptotically  inde- 
pendent. This  fact  can  be  proved  by  methods  similar  to  those  used  earlier  in 
this  section.  Hence,  in  testing  hypotheses  concerning  the  Pij{t)  it  will  sometimes 
be  possible  to  reformulate  the  hypotheses  in  terms  of  m  X  T"-  independent 
samples  consisting  of  multinomial  trials,  and  standard  test  procedures  may  then 
be  applied. 

3.  Tests  of  hypotheses  and  confidence  regions. 

3.1.  Tests  of  hypotheses  about  specific  probabilities  and  confidence  regions. 

On  the  basis  of  the  asymptotic  distribution  theory  in  the  preceding  section,  we 
can  derive  certain  methods  of  statistical  inference.  Here  we  shall  assume  that 
every  pij  >  0. 

First  we  consider  testing  the  hypothesis  that  certain  transition  probabiUties 
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p.j  have  specified  values  pij .  We  make  use  of  the  fact  that  under  the  null  hy- 
pothesis the  {n*f  ^  (pij  —  pij)  have  a  limiting  normal  distribution  with  means 
zero,  and  variances  and  covariances  depending  on  p?j  in  the  same  way  as  ob- 
tains for  multinomial  estimates.  We  can  use  standard  asymptotic  theory  for 
multinomial  or  normal  distributions  to  test  a  hypothesis  about  one  or  more 
Pij ,  or  determine  a  confidence  region  for  one  or  more  pij . 

As  a  specific  example  consider  testing  the  hypothesis  that  p,j  =  ph  ,  j  = 
1,  •  ■  •  ,  m,  for  a  given  i.  Under  the  null  hypothesis, 

m  /A  0  \2 

(3.1)  E  n*  ^P^LZ^ 

i=i  Pij 

has  an  asymptotic  x  -distribution  with  m  —  1  degrees  of  freedom  (according  to 
the  usual  asymptotic  theory  of  multinomial  variables).  Thus  the  critical  region 
of  one  test  of  this  hypothesis  at  significance  level  a  consists  of  the  set  pij  for 
which  (3.1)  is  greater  than  the  a  significance  point  of  the  x^-distribution  with 
m  —  1  degrees  of  freedom.  A  confidence  region  of  confidence  coefficient  a  con- 
sists of  the  set  pij  for  which  (3.1)  is  less  than  the  a  significance  point.  (The  po 
in  the  denominator  can  be  replaced  by  pij.)  Since  the  variables  n*{pij  —  pijf 
for  different  i  are  asymptotically  independent,  the  forms  (3.1)  for  different  i  are 
asymptotically  independent,  and  hence  can  be  added  to  obtain  other  x^- variables. 
For  instance  a  test  for  all  pij  (i,  j  =  1,  2,  •  •  •  ,  m)  can  be  obtained  by  adding 

(3.1)  over  all  i,  resulting  in  a  x^-variable  with  m(m  —  1)  degrees  of  freedom. 
The  use  of  the  x  -test  of  goodness  of  fit  is  discussed  in  [5].  We  believe  that 

there  is  as  good  reason  for  adopting  the  tests,  which  are  analogous  to  x^-tests 
of  goodness  of  fit,  described  in  this  section  as  in  the  situation  from  which  they 
were  borrowed  (see  [5]). 

3.2.  Testing  the  hjrpothesis  that  the  transition  probabilities  are  constant. 

In  the  stationary  Markov  chain,  pij  is  the  probability  that  an  individual  in 
state  i  at  time  t  —  1  moves  to  state  j  at  t.  A  general  alternative  to  this  assump- 
tion is  that  the  transition  probability  depends  on  t;  let  us  say  it  is  Pij{t).  We  test 
the  null  hypothesis  H:pij(t)  =  pij  (t  =  1,  ■  ■  ■  ,  T).  Under  the  alternate  hy- 
pothesis, the  estimates  of  the  transition  probabiUties  for  time  t  are 

(3.2)  pijit)  -      ""'^^^^ 


riiit  -  1)  ■ 
The  hkelihood  function  maximized  under  the  null  hypothesis  is 

(3.3)  IlUPir''- 

The  likelihood  function  maximized  under  the  alternative  is 

(3.4)  UllP.itr^''. 

t       i.j 
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The  ratio  is  the  likelihood  ratio  criterion 


(3.5) 


PiiWJ 


A  shght  extension  of  a  theorem  of  Cramer  [6]  or  of  Neyman  [11]  shows  that 
—  2  log  X  is  distributed  as  x  with  (T  —  1)  [m(m  —  1)]  degrees  of  freedom  when 
the  null  hypothesis  is  true. 

The  Hkelihood  ratio  (3.5)  resembles  likelihood  ratios  obtained  for  standard 
tests  of  homogeneity  in  contingency  tables  (see  [6],  p.  445).  We  shall  now  de- 
velop further  this  similarity  to  usual  procedures  for  contingency  tables.  A  proof 
that  the  results  obtained  by  this  contingency  table  approach  are  asymptotically 
equivalent  to  those  presented  earlier  in  this  section  will  be  given  in  Section  6. 

For  a  given  i,  the  set  pij(t)  has  the  same  asymptotic  distribution  as  the  esti- 
mates of  multinomial  probabilities  Pij{t)  for  T  independent  samples.  An  m  X  T 
table,  which  has  the  same  formal  appearance  as  a  contingency  table,  can  be 
used  to  represent  the  joint  estimates  pij{t)  for  a  given  i  and  for  j  =  1,  2,  •  •  •  ,  m 
and  f  =  1,  2,  •  •  •  ,  r. 


1 

2       • 

m 

1 

pM 

P.2(l)      • 

•  •    Pim(l) 

2 

pM) 

M2)    • 

•  •  PiM) 

T 

Piim 

PAT)  ■ 

•  ■  vUT) 

The  hypothesis  of  interest  is  that  the  random  variables  represented  by  the  T 
rows  have  the  same  distribution,  so  that  the  data  are  homogeneous  in  this 
respect.  This  is  equivalent  to  the  hypothesis  that  there  are  m  constants  pn  , 
Pi2 ,  •  •  •  ,  Pirn ,  with  ^y  pij  =  1,  such  that  the  probability  associated  with  the 
jth.  column  is  equal  to  pa  in  all  T  rows;  that  is,  Pij(t)  =  pa  for  t  —  1,2,  ■  •  ■  ,  T. 
The  x^-test  of  homogeneity  seems  appropriate  here  ([6],  p.  445);  that  is,  in  order 
to  test  this  hypothesis,  we  calculate 


(3.6) 


=  22  ni(t  -  l)[pij(t)  -  pijf  I  Pij  ; 


if  the  null  hypothesis  is  true,  Xi  has  the  usual  limiting  distribution  with  (m  —  1) 
(T  —  1)  degrees  of  freedom. 

Another  test  of  the  hypothesis  of  homogeneity  for  T  independent  samples 
from  multinomial  trials  can  be  obtained  by  use  of  the  likelihood  ratio  criterion; 
that  is,  in  order  to  test  this  hypothesis  for  the  data  given  in  the  m  X  T  table, 
calculate 

(3.7)  X,  =  IliPij/PiAt)^''''. 

t,j 

which  is  formally  similar  to  the  likelihood  ratio  criterion.   The  asymptotic 
distribution  of  —  2  log  Xi  is  x^  with  (m  —  1){T  —  1)  degrees  of  freedom. 


I 


i 
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The  preceding  remarks  relating  to  the  contingency  table  approach  dealt 
with  a  given  value  of  i.  Hence,  the  hypothesis  can  be  tested  separately  for  each 
value  of  i. 

Let  us  now  consider  the  joint  hypothesis  that  Vait)  =  pa  for  all  i  =  1,  2,  •  •  •  , 
m,  j  =  1,  2,  •  ■  ■  ,  m,  t  =  I,  •  •  ■  ,  T.  A  test  of  this  joint  null  hypothesis  follows 
directly  from  the  fact  that  the  random  variables  Pij{t)  and  pij  for  two  different 
values  of  i  are  asymptotically  independent.  Hence,  under  the  null  hypothesis, 
the  set  of  Xi  calculated  for  each  i  =  1,2,  •  •  •  ,m  are  asymptotically  independent, 
and  the  sum 

TO  

(3.8)  x'  =  E  X?  =  E  E  riiit  -  DlMt)  -  Vi^'  I  Va 

t  =  l  i       t,j 

has  the  usual  limiting  distribution  with  m{m  —  1)(7'  —  1)  degrees  of  freedom. 
Similarly,  the  test  criterion  based  on  (3.5)  can  be  written 

Ttl 

(3.9)  Z  -2  log  Xi  =  -2  log  X. 

i=l 

3.3.  Test  of  the  hypothesis  that  the  chain  is  of  a  given  order.  Consider  first  a 
second-order  Markov  chain.  Given  that  an  individual  is  in  state  i  &t  t  —  2  and 
injatt—  1,  let  pukit)  (i,j,k=l,---,m;t  =  2,3,---,T)he  the  probabihty 
of  being  in  state  k  at  t.  When  the  second-order  chain  is  stationary,  pukit)  = 
Pijk  for  t  =  2,  •  •  •  ,  T.  A  first-order  stationary  chain  is  a  special  second-order 
chain,  one  for  which  Pijk(t)  does  not  depend  on  i.  On  the  other  hand,  as  is  well- 
known,  the  second-order  chain  can  be  represented  as  a  more  comphcated  first- 
order  chain  (see,  e.g.  [2]).  To  do  this,  let  the  pair  of  successive  states  i  and  j 
define  a  composite  state  (i,  j).  Then  the  probability  of  the  composite  state 
{j,  k)  at  t  given  the  composite  state  (i,  j)  at  ^  —  1  is  Pijk{t).  Of  course,  the  prob- 
abihty of  state  (h,  k),  h  9^  j,  given  (i,  j),  is  zero.  The  composite  states  are  easily 
seen  to  form  a  chain  with  m  states  and  with  certain  transition  probabiUties  0. 
This  representation  is  useful  because  some  of  the  results  for  first-order  Markov 
chains  can  be  carried  over  from  Section  2. 

Now  let  riijkit)  be  the  number  of  individuals  in  state  z  at  i  —  2,  in  j  at  ^  —  1, 
and  in  k  at  t,  and  let  nij{t  —  1)  =  E^  ''^Hkit).  We  assume  in  this  section  that 
the  ni(0)  and  Wij(l)  are  nonrandom,  extending  the  idea  of  the  earUer  sections 
where  the  ni(0)  were  nonrandom  and  the  n^Xl)  were  random  variables.  The 
riijkii)  (i,  j,  k  =  1,  ■  ■  •  ,  m;  t  =  2,  •  •  •  ,  T)  is  a  set  of  sufficient  statistics  for 
the  different  sequences  of  states.  The  conditional  distribution  of  nijk{t),  given 
nij{t  —  1),  is 

k 

(When  the  transition  probabiUties  need  not  be  the  same  for  each  time  interv^al, 
the  symbols  pijk  should,  of  course,  be  replaced  by  the  appropriate  Pukit)  through- 
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out).  The  joint  distribution  of  nijk{t)  for  i,  j,  k  =  I,  ■  •  •  ,  m  and  t  =  2,  ■  •  ■  ,  T, 
when  the  set  of  n,;(l)  is  given,  is  the  product  of  (3.10)  over  i,  j  and  t. 

For  chains  with  stationary  transition  probabiUties,  a  stronger  result  concern- 
ing sufficiency  can  be  obtained  as  it  was  for  first-order  chains;  namely,  the 
numbers  riijk  =  ^r=2  nijk{t)  form  a  set  of  sufficient  statistics.  The  maximum 
likelihood  estimate  of  pijk  for  stationary  chains  is 

/m  T  I     T 

H  '^m  =  21  Wiyt(0  /  2Z  nait  —  1). 
;=i  (=2  /     «=2 

Now  let  us  consider  testing  the  null  hypothesis  that  the  chain  is  first-order 
against  the  alternative  that  it  is  second-order.  The  null  hypothesis  is  that 
Pi>fc  =  V'2.jk  =  •••  =  'Pmik  =  Vik,  say,  fory,  A:  ==  1,  •  •  •  ,  m.  The  likelihood 
ratio  criterion  for  testing  this  hypothesis  is^ 

m 

(3.12)  X  =  n  (PiA.  /  p^i.)"'■'^ 

i,i,k=\ 

where 

m  I    m       m  T  i  T—\ 

(3.13)  p^vi  =  2  w,yfc  /HI]  '^m  ^  Z)  ^7Vfc(0  /  23  w/O 

is  the  maximum  likelihood  estimate  of  y^k  ■  We  see  here  that  pjk  differs  some- 
what from  (2.8).  This  difference  is  due  to  the  fact  that  in  the  earlier  section  the 
nij(l)  were  random  variables  while  in  this  section  we  assumed  that  the  nij{l) 
were  nonrandom.  Under  the  null  hypothesis,  —2  log  X  has  an  asymptotic  x"- 
distribution  with  rn(m  —  1)  —  m{m  —  1)  =  m{m  —  1)^  degrees  of  freedom. 

We  observe  that  the  likelihood  ratio  (3.12)  resembles  likelihood  ratios  ob- 
tained for  problems  relating  to  contingency  tables.  We  shall  now  develop  further 
this  similarity  to  standard  procedures  for  contingency  tables. 

For  a  given  j,  the  n'^  {puk  —  puk)  have  the  same  asymptotic  distribution  as 
the  estimates  of  multinomial  probabilities  for  m  independent  samples  {i  =  1, 
2,  •  •  •  ,  m).  An  m  X  m  table,  which  has  the  same  formal  appearance  as  a 
contingency  table,  can  be  used  to  represent  the  estimates  p^*  for  ^  given  j 
and  for  t,  /e  =  1,  2,  •  •  •  ,  m.  The  null  hypothesis  is  that  yuk  =  Vjk  for  i  =  1, 
2,  •  •  •  ,  m,  and  the  x^-test  of  homogeneity  seems  appropriate.  To  test  this  hy- 
pothesis, calculate 

(3.14)  Xi  =  2Z  n*j(pijk  -  pjkf/pjk  , 

i,k 

where 

T  T  T  —  l 

(3.15)  n*j  =  2]  riijk  =  23  22  nijk(t)  =  22  ^ijit  -  1)  -=   2Z  nij(t). 

k  k      t=2  1=2  (=1 

If  the  hypothesis  is  true,  Xi  has  the  usual  limiting  distribution  with  (w  —  1) 
degrees  of  freedom. 


2  The  criterion  (.3.12)  was  written  incorrectly  in  (6.35)  of  [1]  and  (4.10)  of  [2]. 
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111  continued  analogy  with  Section  3.2,  another  test  of  the  hypothesis  of 
homogeneity  for  m  independent  samples  from  multinomial  trials  can  be  ob- 
tained by  use  of  the  likelihood  ratio  criterion.  We  calculate 

(3.16)  X,  =  n  (pjk  I  ViiLT''\ 

which  is  formally  similar  to  the  likelihood  ratio  criterion.  The  asymptotic 
distribution  of  —2  log  X>  is  %  with  (m  —  1)^  degrees  of  freedom. 

The  preceding  remarks  relating  to  the  contingency  table  approach  dealt  with 
a  given  value  of  j.  Hence,  the  hypothesis  can  be  tested  separately  for  each 
value  of  j. 

Let  us  now  consider  the  joint  hypothesis  that  p,yt  =  pjk  for  all  i,  j,  k  =  1, 
2,  •  •  •  ,  m.  A  test  of  this  joint  hypothesis  can  be  obtained  by  computing  the  sum 

m 

(3.17)  X^  =  Hxj  =   Z)  n*j(pijk  -  Vikf  I  Vik  , 

j  =  l  j.i.k 

which  has  the  usual  Umiting  distribution  with  m(m  —  1)^  degrees  of  freedom. 
Similarly  the  test  criterion  based  on  (3.12)  can  be  written 

m  

Z  -2  log  Xy  -=  -2  log  X  =  2  X)  nijk  log  [pijk  /  p>J 

(3.18)  ^='  ''" 

=  2  X)  nijk  [log  pijk  -  log  pjk]. 

ijk 

The  preceding  remarks  can  be  directly  generalized  for  a  chain  of  order  r. 
Let  pij...ki  {i,  j,  ■  ■  ■  ,  k,  I  =  1,  2,  ■  •  •  ,  m)  denote  the  transition  probability  of 
state  I  at  time  t,  given  state  k  at  time  t  —  1  •  •  •  and  state  j  at  time  t  —  r  -{-  I 
and  state  i  at  time  t  —  r  (t  ^  r,r  -{-  1,  •  ■  •  ,  T).  We  shall  test  the  null  hypothesis 
that  the  process  is  a  chain  of  order  r  —  1  (that  is,  Pij...ki  =  Pj-ki  for  i  =  I, 
2,  •  •  •  ,  m)  against  the  alternate  hypothesis  that  it  is  not  an  r  —  1  but  an  r-order 
chain. 

Let  nij...ki{t)  denote  the  observed  frequency  of  the  states  i,  j,  •  ■  •  ,  k,  I  at 
the  respective  times  t  —  r,  t  —  r  -\-  I,  •  ■  •  ,  t  —  I,  t,  and  let  nij...k(t  —  1)  = 
^7=1  'nij...ki{t).  We  assume  here  that  the  nij...k(r  —  1)  are  nonrandom.  The 
maximum  likelihood  estimate  of  pij..  .ki  is 

(3.19)  Pij...ki  =  nij...ki/n*j...k , 

where  nij...ki  =  Xr=r  nij...ki(t)  and 

r  r-i 

(3.20)  n*j...k  =  X  nij...ki  =  ^  nij...k(t  —  1)  =    H    n,j...k{t). 


For  a  given  set  j,  •  •  •  ,  fc,  the  set  Pij...ki  will  have  the  same  asymptotic  distribu- 
tion as  estimates  of  multinomial  probabilities  for  m  independent  samples  {i  = 
2,  •  •  • ,  m),  and  may  be  represented  by  an  m  X  m  table.  If  the  null  hypothesis 
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('P:j-ki  =  Pj--ki  for  z  =  1,  2,  •  •  •  ,  m)  is  true,  then  the  x  -test  of  homogeneity 
seems  appropriate,  and 

(3.21)  xl-k  =  1^  n*j...k(pij...ki  -  Pj...kif  I  Pj...ki , 
where 

(3.22)  pj...ki  =  S  n.ij...ki  I  21  n*j...k  =  22  nj...kiit)  /    2^   nj...k(t), 

has  the  usual  limiting  distribution  with  (m  —  1)^  degrees  of  freedom.  We  see 
here  that  pj...ki  differs  somewhat  from  the  maximum  likelihood  estimate  for 
Pj...ki  for  an  (r  —  l)-order  chain  (viz.,  2ir=r-i  ?^y..-i^(0/2Z^=r-2  nj...k(t)).  This 
difference  is  due  to  the  fact  that  the  nj...ki(r  —  1),  for  an  (r  —  1) -order  chain, 
are  assumed  to  be  multinomial  random  variables  with  parameters  Pj...ki  while 
in  this  paragraph  we  have  assumed  that  the  nj...ki{r  —  1)  are  fixed. 

Since  there  are  m~^  sets  j,  ■  •  ■,  k  (j  =  1,  2,  •  •  ■  ,  m;  ■  ■  ■  ;  k  =  1,  2,  •  •  •  ,  m), 
the  sum  2Zi, ■  •  -,*  Xy ••  -fc  will  have  the  usual  hmiting  distribution  with  ni'~\m  —  if 
degrees  of  freedom  under  the  joint  null  hypothesis  {pij...ki  =  Pj-.-ki  for  i  = 
1,2,  •  •  •  ,  m  and  all  values  from  1  to  m  of  j,  •  •  •  ,  k)  is  true. 

Another  test  of  the  null  hypothesis  can  be  obtained  by  use  of  the  likelihood 
ratio  criterion 

(3.23)  \j...k  =  n  {Pr..ki/Pij...ir'     ", 

i.l 

where  —2  log  Xj.-.k  is  distributed  asymptotically  as  x^  with  (m  —  if  degrees 
of  freedom.  Also, 

(3.24)  2Z    {-21ogXy...t}  =  2     2     nij...ki\og{pij...u/pj-ki) 

j,---,k  i,j,-'',k,l 

has  a  limiting  x^-distribution  with  m~^{m  —  \f  degrees  of  freedom  when  the 
joint  null  hypothesis  is  true  (see  [10]). 

In  the  special  case  where  r  =  1,  the  test  is  of  the  null  hypothesis  that  ob- 
servations at  successive  time  points  are  statistically  independent  against  the 
alternate  hypothesis  that  observations  are  from  a  first-order  chain. 

The  reader  will  note  that  the  method  used  to  test  the  null  hypothesis  that 
the  process  is  a  chain  of  order  r  —  1  against  the  alternate  hypothesis  that  it 
is  of  order  r  can  be  generalized  to  test  the  null  hypothesis  that  the  process  is  of 
order  u  against  the  alternate  hypothesis  that  it  is  of  order  r  (m  <  r).  By  an  ap- 
proach similar  to  that  presented  earUer  in  this  section,  we  can  compute  the 
X^-criterion  or  —2  times  the  logarithm  of  the  Hkelihood  ratio  and  observe  that 
these  statistics  are  distributed  asymptotically  as  x^  with  [m''  —  m"]im  —  1)  de- 
grees of  freedom  when  the  null  hypothesis  is  true. 

In  this  section,  we  have  assumed  that  the  transition  probabihties  are  the 
same  for  each  time  interval,  that  is,  stationary.  It  is  possible  to  test  the  null 
hypothesis  that  the  rth  order  chain  has  stationary  transition  probabihties 
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using  methods  that  are  straightforward  generalizations  of  the  tests  presented 
in  the  previous  section  for  the  special  case  of  a  first-order  chain. 

3.4.  Test  of  the  hypothesis  that  several  samples  are  from  the  same  Markov 
chain  of  a  given  order.  The  general  approach  presented  in  the  previous  sections 
can  be  used  to  test  the  null  hypothesis  that  s  (s  ^  2)  samples  are  from  the  same 
rth  order  Markov  chain;  that  is,  that  the  s  processes  are  identical. 

Let  Pi'j\.ici  =  Uif.  ..ki/ntj'!'}.k  denote  the  maximum  likelihood  estimate  of  the 
rth  order  transition  probability  Pif...ki  for  the  process  from  which  sample  h 
(h  =  1,2,  •  •  •  ,  s)  was  obtained.  We  wish  to  test  the  null  hypothesis  that  plf. .  .ki  = 
Pij...ki  for  /i  =  1,  2,  •  •  •  ,  s.  Using  the  approach  presented  herein,  it  follows  that 

/o  r>r\  2  V'  *Cn)     f'W  -.  (.)  \2  /  -  (.) 

(3.25)  Xii---k  =  l^h.i  nij\.'.k(pij'..ki  -  Pij'..ki)  /Pij---ki , 

where  ?i,>..,u  =   '^h  ni'}\.ki  and  Pi'j''...ki  =  riif. . .ki/Yl7=i  ni'L.kv  ,  has  the  usual 
limiting  distribution  with  (s  —  l)(m  —   1)  degrees  of  freedom.  Also,  zZi, >,.... /t 
xh---k  has  a  limiting  x^-distribution  with  in(s  —  l)(m  —  1)  degrees  of  freedom. 
When  s  =  2,  xo- ■  -k  can  be  rewritten  in  the  form 

(3.26)  xh---k  =  z2i(^ij-k  {pi]---ki  —  pfi-.-kif  Ipij-  .-ki  ■> 

where  pil  ..ki  is  the  estimate  of  pij...ki  obtained  by  pooling  the  data  in  the  two 
samples,  and  Cil...k  =  i'i-/n*j}..k)  +  (l/n*j^..k).  Also,  2^i,y,....fc  Xiy---fc  has  the 
usual  limiting  distribution  with  rnim  —  1)  degrees  of  freedom  in  the  two  sample 
case. 

Analogous  results  can  also  be  obtained  using  the  likelihood-ratio  criterion. 

3.5.  A  test  involving  two  sets  of  states.  In  the  case  of  panel  studies,  a  person 
is  usually  asked  several  questions.  We  might  classify  each  individual  according 
to  his  opinion  on  two  different  questions.  In  an  example  in  [2],  one  classification 
indicated  whether  a  person  saw  the  advertisement  of  a  certain  product  and  the 
other  whether  he  bought  the  product  in  a  certain  time  interval.  Let  the  state 
be  denoted  (a,  /3),  a  =  I,  ■  ■  ■  ,  A  and  /3  =  1,  •  ■  ■  ,  B  where  a  denotes  the  first 
opinion  or  class  and  0  the  second.  We  assume  that  the  sequence  of  states  satisfies 
a  first-order  Markov  chain  with  transition  probabilities  Pa&.ni^  •  We  ask  whether 
the  sequence  of  changes  in  one  classification  is  independent  of  that  in  the  second. 
For  example,  if  a  person  notices  an  advertisement,  is  he  more  likely  to  buy  the 
product?  The  null  hypothesis  of  independence  of  changes  is 

(3.27)  Pae.n,'  =  qa^^r^u  a,  fj.  ^  I,  ■  ■  •  ,  A;  ^,  v  ^  1,  ■  ■  ■  ,  B, 

where  ga^  is  a  transition  probability  for  the  first  classification  and  r^„  is  for  the 
second.  We  shall  find  the  likelihood  ratio  criterion  for  testing  this  null  hypothesis. 
Let  naff.^vit)  be  the  number  of  individuals  in  state  (a,  (3)  at  t  —  I  and  (m,  v) 
at  t.  From  the  previous  results,  the  maximum  likelihood  estimate  of  Pa^,^, , 
when  the  null  hypothesis  is  not  assumed,  is 

(3.28)  Paff.iiv     =    ~A        B^ 


256  READINGS  IN  MATHEMATICAL   PSYCHOLOGY 

where  naff,^v  =    2Jf=i  iia^.^vit)-  When  the  null  hypothesis  is  assumed,  the  max- 
imum hkelihood  estimate  of  Pa&.iiv  is  qa^  r^y ,  where 


(3.29) 

(3.30) 

A 

The  likehhood  ratio  criterion  is 

(3.31) 

X  -' 

T           A             B        /  -<        "      \ 

=  n  n  n  (^'■" 

Under  the  null  hypothesis,  —2  log  X  has  an  asymptotic  x"-distribution,  and 
the  number  of  degrees  of  freedom  is  AB^AB  —  1)  —  A(A  —  1)  —  B{B  —  1)  = 
{A  -  \){B  -  \){AB  -^  A+B). 

4.  A  modified  model.  In  the  preceding  sections,  we  assumed  that  the  ni(0) 
were  nonrandom.  An  alternative  is  that  the  ni(0)  are  distributed  multinomially 
with  probability  -qi  and  sample  size  n.  Then  the  distribution  of  the  set  n,j(0 
is  (2.5)  multiphed  by  the  marginal  distribution  of  the  set  /i/(0)  which  is 

(4.1)  i^^^^nr,.r'^r 

11^^.(0)!'=^ 
J=l 

In  this  model,  the  maximum  likelihood  estimate  of  p.y  is  again  (2.8),  and  the 
maximum  likelihood  estimate  of  -m  is 

(4.2)  rii  = . 

n 

The  means,  variances,  and  covariances  of  nij{t)  —  ni{t  —  l)pij  are  found  by 
taking  the  expected  values  of  (2.20)  to  (2.23);  the  same  formulas  apply  with 
nfc(O)  replaced  by  nrik  .  Also  n,XO  ~"  ^i(^  ~  ^)Pij  are  uncorrelated  with  ^.^(0). 
Since  nk{Q)/n  estimates  T]k  consistently,  the  asymptotic  variances  and  covariances 
of  n'^  {pij  —  pij)  are  as  in  Section  2.4.  It  follows  from  these  facts  that  the 
asymptotic  theory  of  the  tests  given  in  Section  3  hold  for  this  modified  model. 

The  asymptotic  variances  and  covariances  simplify  somewhat  if  the  chain 
starts  from  a  stationary  state;  that  is,  if 

m 

(4.3)  ^VkPki  =  Vi- 

k=l 
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For  then  /_,  Vk  pi!"  =  Vi  and  <f)^  =  Ti)i .  If  it  is  known  that  the  chain  starts 
from  a  stationary  state,  equations  (4.3)  should  be  of  some  additional  use  in  the 
estimation  of  -pn  when  knowledge  of  the  -m  ,  or  even  estimates  of  the  7?i ,  are 
available.  We  have  dealt  in  this  paper  with  the  more  general  case  where  it  is 
not  known  whether  (4.3)  holds,  and  have  used  the  maximum  likelihood  esti- 
mates for  this  case.  The  estimates  obtained  for  the  more  general  case  are  not 
efficient  in  the  special  case  of  a  chain  in  a  stationary  state  because  relevant 
information  is  ignored.  In  the  special  case,  the  maximum  likelihood  estimates 
for  the  Tj{  and  pij  are  obtained  by  maximizing  log  L  —  zL^ij  log  p,_,  +  /mAO) 
log  rji  subject  to  the  restrictions  '^jPa  =  1,  "^i  ViPa  =  Vj ,  Sj  Vj  =  1,  Pa  ^ 
0,  rji  =  0-  In  the  case  of  a  chain  in  a  stationary  state  where  the  th  are  known, 
the  maximum  likelihood  estimates  for  the  pa  are  obtained  by  maximizing 
zLnij  log  Pij  subject  to  the  restrictions  zljPii  =  1)  zlt  ViPa  =  Vj ,  Pij  =  0. 
Lagrange  multipliers  can  be  used  to  obtain  the  equations  for  the  maximum 
hood  estimates. 

5.  One  observation  on  a  chain  of  great  length.  In  the  previous  sections, 
asymptotic  results  were  presented  for  Wi(0)  -^  oo ,  and  hence  z27=i  ^i(O)  = 
w  ^  00,  while  T  was  fixed.  The  case  of  one  observed  sequence  of  states  (n  =  1) 
has  been  studied  by  Bartlett  [3]  and  Hoel  [10],  and  they  consider  the  asymptotic 
theory  when  the  number  of  times  of  observation  increases  (T  -^  oo).  Bartlett 
has  shown  that  the  number  Uij  of  times  that  the  observed  sequence  was  in 
state  i  at  time  t  —  I  and  in  state  j  at  time  t,  for  t  —  1 ,  •  •  •  ,  T,  is  asymptotically 
normally  distributed  in  the  'positively  regular'  situation  (see  [3],  p.  91).  He  also 
has  shown  ([3],  p.  93)  that  the  maximum  likelihood  estimates  pa  ^  n,j/nt 
(n*  =  z27=i  ^ij)  have  asymptotic  variances  and  covariances  given  by  the  usual 
multinomial  formulas  appropriate  to  8  rii  independent  observations  (i  =  1, 
2,  ■  •  •  ,  m)  from  multinomial  probabiUties  pij  ( j  =  1,  2,  •  •  •  ,  m),  and  that  the 
asymptotic  covariances  for  two  different  values  of  i  are  0.  An  argument  like 
that  of  Section  2.4  shows  that  the  variables  (n*)^'^  (pij  —  pa)  have  a  limiting 
normal  distribution  with  means  0  and  the  variances  and  covariances  given  in 
Section  2.4.  This  result  was  proved  in  a  different  way  by  L.  A.  Gardner  [8]. 

Thus  we  see  that  the  asymptotic  theory  f or  T  ^  co  and  w  =  1  is  essentially 
the  same  as  for  T  fixed  and  Wi(0)  — >  =« .  Hence,  the  same  test  procedures  are 
valid  except  for  such  tests  as  on  possibly  nonstationary  chains.  For  example, 
Hoel's  likelihood  ratio  criterion  [10]  to  test  the  null  hypothesis  that  the  order 
of  the  chain  is  r  —  1  against  the  alternate  hypothesis  that  it  is  r  is  parallel  to 
the  likelihood  ratio  criterion  for  this  test  given  in  Section  3.3.  The  x  -test  for 
this  hypothesis,  and  the  generalizations  of  the  tests  to  the  case  where  the  null 
hypothesis  is  that  the  process  is  of  order  u  and  the  alternate  hypothesis  is  that 
the  process  is  of  order  r{u  <  r),  which  are  presented  in  Section  3.3,  are  also 
applicable  for  large  T.  Also,  the  x"-test  presented  in  Section  3.1  can  be  generalized 
to  provide  an  alternative  to  Bartlett's  likelihood  ratio  criterion  [3]  for  testing 
the  null  hypothesis  that  Pij...ki  =  Pa-.-ki  (specified). 
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6.  x^-tests  and  likelihood  ratio  criteria.  The  x^-tests  presented  in  this  paper 
are  asymptotically  equivalent,  in  a  certain  sense,  to  the  corresponding  likelihood 
ratio  tests,  as  will  be  proved  in  this  section.  This  fact  does  not  seem  to  follow 
from  the  general  theory  of  x -tests;  the  x -tests  presented  herein  are  different 
from  those  x^-tests  that  can  be  obtained  directly  by  considering  the  number  of 
individuals  in  each  of  the  m  possible  mutually  exclusive  sequences  (see  Section 
2.1)  as  the  multinomial  variables  of  interest.  The  x'^-tests  based  on  m^  categories 
need  not  consider  the  data  as  having  been  obtained  from  a  Markov  chain  and 
the  alternate  hypothesis  may  be  extremely  general,  while  the  x^-tests  presented 
herein  are  based  on  a  Markov  chain  model. 

For  small  samples,  not  enough  data  has  been  accumulated  to  decide  which 
tests  are  to  be  preferred  (see  comments  in  [5]).  The  relative  rate  of  approach 
to  the  asymptotic  distributions  and  the  relative  power  of  the  tests  for  small 
samples  is  not  known.  In  this  section,  a  method  somewhat  related  to  the  rela- 
tive power  will  be  tentatively  suggested  for  deciding  which  tests  are  to  be  pre- 
ferred when  the  sample  size  is  moderately  large  and  there  is  a  specific  alternate 
hypothesis.  An  advantage  of  the  x^-tests,  which  are  of  the  form  used  in  con- 
tingency tables,  is  that,  for  many  users  of  these  methods,  their  motivation 
and  their  application  seem  to  be  simpler. 

We  shall  now  prove  that  the  likelihood  ratio  and  the  x^-tests  (tests  of  ho- 
mogeneity) presented  in  Section  3.2  are  asymptotically  equivalent  in  a  certain 
sense.  First,  we  shall  show  that  the  x^-statistic  has  an  asymptotic  x^-distribution 
under  the  null  hypothesis.  The  method  of  proof  can  be  used  whenever  the 
relevant  p's  have  the  appropriate  limiting  normal  distribution.  In  particular, 
this  will  be  true  for  statistics  of  the  form  x»  (see  (3.6)).  In  order  to  prove  that 
statistics  of  the  form  X,  (see  (3.7)),  which  are  formally  similar  to  the  likelihood 
ratio  criterion  but  are  not  actually  hkelihood  ratios,  have  the  appropriate 
asymptotic  distribution,  we  shall  then  show  that  —2  log  \i  is  asymptotically 
equivalent  to  the  Xi-statistic,  and  therefore  it  has  an  asymptotic  x  -distribution 
under  the  null  hypothesis.  Then  we  shall  discuss  the  question  of  the  equiva- 
lence of  the  tests  under  the  alternate  hypothesis.  The  method  of  proof  presented 
here  can  be  applied  to  the  appropriate  statistics  given  in  the  other  sections 
herein,  and  also  where  T  ^  =o  as  well  as  where  n  ^  oo . 

Let  us  consider  the  distribution  of  the  x^-statistic  (3.8)  under  the  null  hy- 
pothesis. From  Section  2.4,  we  see  that  n'^  (pijit)  —  Va)  are  asymptotically 
normally  distributed  with  means  0  and  variances  Pii(l  —  'Pij)/mi{t  —  1),  etc., 
where  mi{t)  =  &ni{t)/n.  For  different  t  or  different  i,  they  are  asymptotically 
independent.  Then  the  [nmi{t  —  1)]^'^  [piXO  ~  Vij]  have  asymptotically  vari- 
ances p,j(l  —  Ptj),  etc.  Let  p*-  =  2Zt  f^iit  —  1)  pij(t)/^t'ini{t  —  1).  Then 
by  the  usual  x^-theory,  ^nrwi  (t  —  l)[pij{t)  —  p*jf/p*i  has  an  asymptotic 
X  -distribution  under  the  null  hypothesis.  But 

(6.1)  p  lim  (p*  -  pij)  =  0 
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because 

(6.2)  p  lim  (^  -  miit))  =  0. 

From  the  convergence  in  probability  of  (p*    —   pa)  and    {niiit)   —   ni{t)/n), 
and  the  fact  that  n'^  (pi At)  —  Pa)  has  a  limiting  distribution,  it  follows  that 

(6.3)  p  lim  [nZ  ^-^^  "  ^^^f^  "  ^*^^'  -  ^  rjMj^JliMLlljl]  =  o. 

L  Pij  Pij  J 

Hence,  the  x^-statistic  has  the  same  asymptotic  distribution  as  zLnniiit  —  1) 
[Pijii)  ~  ptjf/Pij}  that  is,  a  x^-distribution.  This  proof  also  indicates  that  the 
X?-statistics  (3.6)  also  have  a  limiting  x^-distribution.  We  shall  now  show  that 
—  2  log  Xi  (see  (3.7))  is  asymptotically  equivalent  to  Xi  under  the  null  hypothesis; 
and  hence  will  also  have  a  limiting  x  -distribution. 
We  first  note  that  for  |a;|  <  ^ 

(1  +  x)  log  (1  +  x)  =  (1  +  x)ix  -  xl2  +  xjZ  -  xV4  +  •  •  •  ) 
(6.4) 

=  x  +  x72  -  (.tV6)(1  -  x/2  +  •  •  •  ), 
and 

(6.5)  I  (1  +  x)  log  {\^  x)  -  X-  x^/2  I  =  I  (a;V6)(l  -  x/2  +  ■  ■  ■  )\^\x\ 
(see  p.  217  in  [6]).  We  see  also  that 

-2  log  Xi  =   -2  X)  nijit)  log  [pij/pij{t)] 

(6.6)  =  2  E  nit  -  1)  pi,(^)  log  [pii{t)/vd 

j,t 

=  2  Z  Wi(^  -  Dp.ytl  +  ^vi(^)]  log  [1  +  Xij(t)], 

i.t 

where  a;iy(0  =  [paii)  —  PiMpa  •  The  difference  A  between  —2  log  Xi  and  the 
Xi-statistic  is 

A  =  -2  1ogX.  -  Xi 

(6.7)  ^ 

-  2  Y.i.i  riiit  -  i)piA[i  +  xiM  log  [1  +  xiM  -  [x.mm. 

Since  2l7=i  Pa^ijit)  =  0, 

(6.8)  A  =  2  E  riiit  -  l)piA[l  +  Xiiii)]  log  [1  +  a;iy(0]-a;.y(i)   -  [xiXOf/2}. 

We  shall  show  that  A  converges  to  0  in  probability;  i.e.  for  any  e  >  0,  the 
probabiUty  of  the  relation  |  A  |  <  e,  under  the  null  hypothesis,  tends  to  unity  as 
n  =  Ei  ni{t)  —^  00 .  The  probability  satisfies  the  relation 

Pr{  I  A  I  <  e}   ^  Pr-{  I  A  1  <  e  and  I  Xij{t)  \  <  U 

(6.9)  ^  Pr{  1  2  J^j,t  riiit  -  l)pii[Xii{t)f  1  <  €  and  |  XiXO  \  <  h\ 
^  Pr{2n  Y.j.t  I  Xiiit)  I'  <  €  and  ]  Xij{t)  \<h]. 


L 
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It  is  therefore  necessary  only  to  prove  that  n[Xij{t)]  converges  to  0  in  prob- 
ability. Since  Xij(t)  =  [pait)  —  pijl/pij  converges  to  zero  in  probability  under 
the  null  hypothesis,  and 

(6.10)    v^^;Mn  x,m  =  v^^;m^  {[^"%~  ^'']  -  [-^-^^^]} ' 

it  follows  that 

(6.11)  n[xiM'  =  [ixiAt)ny'' Xij{t)f 

converges  to  zero  in  probability  when  the  null  hypothesis  is  true.  Q.E.D. 

Since  the  x  -statistic  has  a  Hmiting  x  -distribution  under  the  null  hypothesis, 
and  A  =  —  2  log  X,  —  x»  converges  in  probability  to  zero,  —2  log  X,  =  x?  +  ^ 
has  a  limiting  x^-distribution  under  the  null  hypothesis. 

The  method  presented  herein  for  showing  the  asymptotic  equivalence  of  —2 
log  Xi  and  Xi  could  also  be  used  to  show  the  asymptotic  equivalence  of  sta- 
tistics of  the  form  —2  log  X  and  x  •  It  was  proved  in  Section  3.2  that,  under 
the  null  hypothesis,  —2  log  X  has  a  limiting  x^-distribution  with  m(m  —  1) 
(T  —  1)  degrees  of  freedom.  (The  proof  in  Section  3.2  applied  to  X,  a  hkeUhood 
ratio  criterion,  but  would  not  apply  to  Xt  since  they  are  not  actually  Ukelihood 
ratios.)  Hence,  we  have  another  proof  that  the  x  -statistic  has  the  same  limiting 
distribution  as  the  likelihood  ratio  criterion  under  the  null  hypothesis. 

The  previous  remarks  refer  to  the  case  where  the  null  hypothesis  is  true. 
Now  suppose  the  alternate  hypothesis  is  true;  that  is,  Pij(t)  ^  Pais)  for  some 
t,  s,  i,  j.  It  is  easy  to  see  that  both  the  x  -test  and  the  Hkelihood  ratio  test  are 
consistent  under  any  alternate  hypothesis.  In  other  words,  if  the  values  of  Pij(t) 
for  the  alternate  hypothesis  and  the  significance  level  are  kept  fixed,  then  as  n 
increases,  the  power  of  each  test  tends  to  1  (see  [5]  and  [11]). 

In  order  to  examine  the  situation  in  which  the  power  is  not  close  to  1  in  large 
samples  and  also  to  make  comparisons  between  tests,  the  alternate  hypothesis 
may  be  moved  closer  to  the  null  hypothesis  as  n  increases.  If  the  values  of 
Pait)  for  the  alternate  hypothesis  are  not  fixed  but  move  closer  to  the  null 
hypothesis,  it  can  be  seen  that  the  two  tests  are  again  asymptotically  equiva- 
lent. This  can  be  deduced  by  a  shght  modification  of  the  proof  of  asymptotic 
equivalence  under  the  null  hypothesis  given  in  this  section  (see  also  [5],  p.  323). 

We  shall  now  suggest  another  approach  to  the  comparison  of  these  tests  when 
the  alternate  hypothesis  is  kept  fixed.  Since  the  null  hypothesis  is  rejected 
when  an  appropriate  statistic  (x^  or  —2  log  X)  exceeds  a  specified  critical  value, 
we  might  decide  that  the  x^-test  is  to  be  preferred  to  the  likelihood  ratio  test 
if  the  statistic  x^  is  in  some  sense  (stochastically)  larger  than  —2  log  X  under 
the  alternate  hypothesis. 

Since  ni(t)  is  a  linear  combination  of  multinomial  variables,  we  see  that 
ni(t)/n  converges  in  probability  to  its  expected  value  &[ni{t)/'n]  =  mi{t).  Hence, 
X  /n  converges  in  probabiUty  to 

(6.12)  £  m,(t  -  l)[pii(t)  -  pijlVpii , 
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and  (  —  2  log  X)/n  converges  in  probability  to 

(6.13)  2  E  Mt  -  Dpiiit)  log  [pijit)/pi,], 

i,},t 

where 

(6.14)  pij  =  XI  Pijit)mi(t  -  1)/  Xl  niiit  -  1)  =  p  lim  pij . 

t  t  n-»c!0 

The  difference  between  (6.12)  and  (6.13)  is  approximately 

(6.15)  Zwii  (t  -  l)[piAt)  -  Pijf/i^pD- 

Under  the  alternate  hypothesis,  these  two  stochastic  limits  differ  from  0, 
and  computation  of  them  suggests  which  test  is  better.  If  {pij{t)  —  Pij)/pij  is 
small,  then  there  will  be  only  a  small  difference  between  the  two  limits.  When 
the  alternative  is  some  composite  hypothesis,  as  is  usually  the  case  when  x  - 
tests  are  applied,  then  these  stochastic  hmits  can  be  computed  and  compared 
for  the  simple  alternatives  that  are  included  in  the  alternate  hypothesis. 

This  method  for  comparing  tests  is  somewhat  related  to  Cochran's  comment 
(see  p.  323  in  [5])  that  either  (a)  the  significance  probability  can  be  made  to 
decrease  as  n  increases,  thus  reducing  the  chance  of  an  error  of  type  I,  or  (b) 
the  alternate  hypothesis  can  be  moved  steadily  closer  to  the  null  hypothesis. 
Method  (b)  was  discussed  in  [3].  If  method  (a)  is  used,  then  the  critical  value 
of  the  statistic  (x^  or  —  log  X)  will  increase  as  ?i  increases.  When  the  critical 
value  has  the  form  en,  where  c  is  a  constant  (there  may  be  some  question  as 
to  whether  this  form  for  the  critical  value  is  really  suitable),  we  see  from  the 
remarks  in  the  preceding  paragraph  that  the  power  of  a  test  will  tend  to  1  if 
c  is  less  than  the  stochastic  limit  and  it  will  tend  to  0  if  c  is  greater  than  the 
stochastic  hmit.  Hence,  by  this  approach  we  find  that  the  power  of  the  x  -test 
can  be  quite  different  from  the  power  of  the  hkelihood  ratio  test,  and  some 
approximate  computations  can  suggest  which  test  is  to  be  preferred. 

However,  a  more  appealing  approach  is  to  vary  the  significance  level  so  the 
ratio  of  significance  level  to  the  probability  of  some  particular  Type  II  error 
approaches  a  limit  (or  at  least  it  seems  that  desirable  sequences  of  significance 
points  lie  between  c'  and  en).  While  the  usual  asymptotic  theory  does  not  give 
enough  information  to  handle  this  problem,  the  comparison  of  stochastic  limits 
may  suggest  a  comparison  of  powers. 

The  methods  of  comparison  discussed  herein  can  also  be  used  in  the  study  of 
the  x^  and  hkelihood  ratio  methods  for  ordinary  contingency  tables.  We  have 
seen  that,  in  a  certain  sense,  the  x^  and  likelihood  ratio  methods  are  not  equiva- 
lent when  the  alternate  hypothesis  is  true  and  fixed,  and  we  have  suggested  a 
method  for  determining  which  test  is  to  be  preferred. 
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A  STOCHASTIC  MODEL  FOR  INDIVIDUAL 
CHOICE  BEHAVIORS 


R.  J.  AUDLEY 
University  College,  London 


This  paper  presents  a  stochastic 
model  which  is  concerned  with  the 
interrelations  of  the  response  variables 
observed  in  choice  situations.  The 
model  is  not  a  complete  theory,  be- 
cause it  involves  no  assumptions  about 
the  relations  between  stimulus  and  re- 
sponse variables.  However,  for  given 
stimulus  conditions,  the  parameters 
of  the  stochastic  process  do  provide  a 
convenient  summary  of  many  aspects 
of  behaviour  in  a  choice  situation. 
Furthermore,  the  most  elementary  as- 
sumptions about  the  way  in  which 
these  parameters  might  vary  with 
changed  stimulus  conditions  lead  to 
predictions  which  are  in  qualitative 
agreement  with  experimental  findings. 
In  a  sense,  therefore,  the  stochastic 
model  can  be  regarded  as  a  rudimen- 
tary theory  of  certain  aspects  of  choice 
behaviour. 

Descriptors  of  Choice  Behavior 

A  wide  variety  of  experiments  re- 
quire the  use  of  a  situation  involving 
a  choice  between  two  or  more  alterna- 
tives. There  are  several  variables 
which  may  be  employed  in  a  descrip- 
tive summary  of  the  behavior  which 

*  The  writer  is  grateful  to  A.  R.  Jonckheere 
for  his  generous  criticisms  during  the  prepara- 
tion of  the  manuscript.  He  and  G.  C.  Drew 
were  also  kind  enough  to  comment  upon  an 
earlier  draft. 


appears  in  these  situations.  These 
variables  can  be  of  two  kinds.  Firstly, 
there  are  descriptors  of  the  primary 
response  to  the  situation,  and,  sec- 
ondly, there  are  descriptors  of  the  re- 
sponses which  the  kS  makes  to  his  pri- 
mary choices.  Those  of  the  first  kind 
are  most  commonly  used  and  the  three 
principal  ones  are :  (a)  Response  time 
— the  time  taken  for  a  definite  choice 
to  be  made,  {b)  Relative  response 
frequency — the  proportion  of  occa- 
sions on  which  a  particular  choice  re- 
sponse is  made,  (c)  The  number  of 
vicarious  trial  and  error  responses 
(VTEs) — the  number  of  vacillations 
between  the  various  alternatives  be- 
fore a  definite  choice  occurs.  In  the 
second  group,  where  the  descriptor  is 
usually  a  verbal  statement  by  the  S, 
there  are  such  variables  as :  (a)  con- 
fidence in  the  correctness  of  a  given 
choice  and  {h)  an  assessment  of  the 
subjective  difficulty  of  the  choice  task. 
Clearly,  the  extent  to  which  these 
various  descriptors  can  be  employed 
will  depend  upon  the  specific  details 
of  an  experiment.  But,  for  many 
choice  situations,  all  three  descriptors 
of  the  first  kind  can  be  employed. 
Also  in  most  studies  with  human  5s 
the  second  kind  are  also  available. 
In  fact,  this  paper  will  be  mainly  con- 
cerned with  the  first  kind  of  descriptor, 
but  some  suggestions  will  be  advanced 
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which  permit  those  of  the  second  kind 
to  be  also  included  in  a  unitary  sto- 
chastic description  of  choice  behavior. 

Particular  Choice  Sihiations  Which  A  re 
Considered 

It  is  believed  that  the  underlying 
hypotheses  upon  which  the  stochastic 
description  is  based  are  applicable  to 
most  choice  situations.  However,  the 
derivation  of  a  mathematical  model 
from  these  hypotheses  which  can  be 
readily  applied  to  experimental  data 
without  additional  assumptions  is 
more  conveniently  achieved  for  a  cer- 
tain class  of  situations.  This  class 
consists  of  experiments  where  knowl- 
edge of  the  outcome  or  correctness  of 
a  response  is  not  available  to  the  5 
until  after  the  choice  has  been  made. 
Thus,  for  example,  most  ordinarj^  dis- 
junctive reaction  time  studies  are  not 
considered  because  the  5  in  these  ex- 
periments can  match  his  response  with 
a  known  requirement.  Nevertheless, 
the  class  of  situations  which  can  be 
considered  is  not  a  trivial  one.  It 
includes  among  others  {a)  Discrimina- 
tion experiments,  including  most  con- 
ventional psychophysical  procedures 
in  this  category,  {b)  Studies  of  prefer- 
ence and  conflict,  (c)  Investigations 
of  learning  in  choice  situations. 

The  next  section  of  the  paper  is 
mainly  concerned  with  the  events  sup- 
posed to  be  taking  place  during  a 
single  experimental  trial. 

The  Stochastic  Model 

The  notions  upon  which  the  model 
is  based  are  very  simple  and  involve 
only  two  assumptions : 

Assumption  1.  It  is  first  assumed 
that,  for  given  stimulus  and  organismic 
conditions,  there  is  associated  with  each 
possible  choice  response  a  single  param- 
eter. This  parameter  determines  the 
probability  that  in  a  small  interval  of 


time  (t,  t  +  At),  tJierc  will  occur  an 
"implicit"  response  of  the  kind  with 
which  the  parameter  is  associated. 

No  specific  interpretation  is  given 
to  the  term  "implicit  response."  It 
may,  in  certain  circumstances,  be 
taken  to  be  equivalent  to  the  partial 
response  usually  classified  as  a  VTE. 
But  there  are  some  situations  in  which 
VTEs  are  not  observed  and  would 
seem  unlikely  to  be  present.  In  these 
cases  the  "implicit"  response  may  be 
regarded  as  a  tendency  to  make  a  given 
response,  or  might  perhaps  be  given 
some  physiological  interpretation. 

The  probabilities  of  the  various 
kinds  of  "implicit"  responses  occur- 
ring are  considered  to  be  independent 
of  one  another.  So  that  for  given  con- 
ditions, implicit  responses  of  each  kind 
are  appearing  at  random  intervals  un- 
affected by  the  appearance  of  other 
implicit  responses.  It  follows  from 
the  first  assumption  that  the  distribu- 
tion of  the  intervals  between  succes- 
sive implicit  responses  of  a  given  kind 
is  exponential  and  is  determined  en- 
tirely by  the  response  parameter 
[e.g.,  see  Feller,  1950,  p.  220]. 

Assumption  2.  It  is  assumed  that 
a  final  choice  response  is  made  when  a 
run  of  K  implicit  responses  of  a  given 
kind  appears,  this  run  being  uninter- 
rupted by  occurrences  of  implicit  re- 
sponses of  other  kinds.  K  may  either 
be  assumed  to  take  a  particular  value 
or  can  be  regarded  as  a  further  param- 
eter, which  can  be  estimated  from  ex- 
perimental data. 

Assumption  1  has  been  employed 
before.  Mueller  (1950)  has  used  this 
approach  to  describe  the  intervals  be- 
tween bar-presses  in  an  operant  condi- 
tioning experiment  where  only  one 
response  is  involved.  For  the  same 
situation,  Estes  (1950)  and  Bush  & 
Mosteller  (1951)  have  used  an  as- 
sumption which  is  very  similar,  the 
only  difiference  being  that  their  models 


R.   J.   AUDLEY 


265 


used  a  discontinuous  rather  than  a 
continuous  distribution  of  responses 
in  time.  Christie  (1952)  in  discussing 
the  determination  of  response  prob- 
abilities in  a  discrimination  experi- 
ment, has  used  the  same  assumption 
for  situations  where  two  responses  are 
competing.  Finally,  the  author  of  the 
present  paper  (Audley:  1957,  1958) 
has  previously  used  the  same  notions 
to  combine  response  times  and  re- 
sponse probabilities  in  a  stochastic  de- 
scription of  individual  learning  be- 
havior. However,  in  all  these  ex- 
amples, it  has  been  assumed  that 
K  =  1.  Bush  and  Mosteller  (1955), 
in  an  analysis  of  response  times  in  a 
runway  situation,  have  considered  a 
continuous  model  with  K  >  1,  but 
this  generalization  does  not  appear  to 
have  been  previously  employed  in  a 
situation  involving  choice. 

There  are  several  reasons  which  can 
be  advanced  for  assuming  that  K  >  1. 
Firstly,  when  i^  =  1,  but  not  if  i^  >  1, 
the  distributions  of  response  times 
for  all  alternatives  can  be  shown  to 
be  identically  the  same,  and  are  ex- 
ponential (e.g.,  see  Audley,  1958). 
Neither  of  these  properties  is  in 
agreement  with  experimental  findings. 
Secondly,  when  K  >  1,  the  sequence 
of  "implicit"  responses  occurring  be- 
fore a  final  choice  is  made  ofifer  a 
possible  means  of  including  VTE's 
within  the  description  of  choice  be- 
havior. Thirdly,  classification  of  the 
various  sequences  of  "implicit"  choice 
suggests  an  approach  to  descriptors  of 
the  second  kind.  For  example,  "per- 
fect confidence"  in  a  choice  might  be 
identified  with  sequences  consisting  of 
"implicit"  responses  of  one  kind  only. 

Derivation  of  the  Stochastic  Model 

No  further  assumptions  are  required 
in  the  derivation  of  the  model,  which 
can  be  applied  to  situations  involving 
any  number,  m,  of  choices.    However, 


in  order  to  keep  the  exposition  as  brief 
as  possible,  consideration  in  this  paper 
will  be  limited  to  situations  involving 
a  choice  between  only  two  alterna- 
tives, i.e.,  m  =  2.  Furthermore,  the 
mathematical  problem  is  relatively 
simple  when  K  =  2,  so  that  only  this 
special  case  will  be  presented.  Re- 
sults for  the  more  general  case  have 
been  derived  and  will  be  elaborated 
elsewhere. 

The  two-choice  situation  with  K  =  2. 
The  two  possible  responses  will  be 
called  A  and  B,  and  implicit  responses 
of  the  two  kinds  will  be  labelled  a  and 
b  respectively.  Let  the  parameters 
associated  with  the  two  responses  be 
a  and  jS.  Assumption  1  means  that 
p{a),  the  probability  of  an  a  occurring 
in  a  small  time  interval  {t,  t  +  A/)  is 
given  by: 

p{a)  =  aM  [la] 


Similarly 

p{h)  =  ^M  [lb] 

The  probability  p  {a  or  h) ,  of  an  im- 
plicit response  of  either  kind  but  not 
both,  occurring  in  the  small  time  in- 
terval is 

p  {a  or  b)=p  (a)  +p  (b)  -  2p  {a)p  (b) 
=  {a-\-0)M-2a^{MY 

Hence 

p{aorb)  =  {a-\-^)^t         [Ic] 

if  terms  of  order  (A^^  are  ignored. 
This  becomes  possible  if  a  transition 
is  made  to  the  continuous  case  when 
the  distribution  in  time  of  implicit  re- 
sponses follows  that  of  a  Poisson  proc- 
ess (e.g.,  see  Feller,  1950,  p.  220). 
Therefore  the  probability,  pin,  t),  of 
obtaining  n  implicit  responses  in  the 
time  interval  (o,  t)  is  (e.g.,  again  see 
Feller,  1950,  p.  221): 

,       ,         (g  +  ,g)n^"e-(°+g)^ 
p{n,t)  = [2J 
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In  particular  the  probability,  p{o,  t), 
of  obtaining  no  implicit  response  of 
either  kind  in  time  t  is  given  by : 


p(o,t)  =  e-(«+^)' 


[3] 


The  probability:  Pa,  that  the  first 
implicit  response  to  occur  is  an  a  is 


/*  00 

Pa  =1       p(o,  t)adt 

J  t=o 


a 


[4a] 


=  say,  p 
Similarly,  for  implicit  b  responses 


Pb  = 


(8 


a  + 


=  say,  q  ^  I  -  p     [4b] 


Since  occurrences  of  implicit  re- 
sponses follow  a  Poisson  process. 
Equations  4a  and  4b  also  give  the 
probability  that,  starting  at  any  given 
moment,  the  next  implicit  response  to 
occur  will  be  an  a  or  5  respectively. 
Therefore,  ignoring  for  the  moment 
questions  concerning  the  time  inter- 
vals between  successive  implicit  re- 
sponses, the  sequence  of  events  lead- 
ing to  a  final  choice  can  be  treated  as 
a  sequence  of  independent  binomial 
trials,  with  the  probabilities,  Pa  and 
Pb,  of  the  two  types  of  event  given  by 
Equations  4a  and  4b. 

The  Probability,   Pa,    That  the   Final 
Choice  is  an  A  Response 

The  possible  sequences  which  ter- 
minate with  the  occurrence  of  an  A 
can  be  easily  classified  when  K  —  2. 
For  they  must  all  be  simple  alterna- 
tions between  a  and  b,  until  two  suc- 
cessive a's  occur.  The  early  members 
of  this  class  of  sequences  are :  aa,  baa, 
abaa,  babaa,  etc.  The  respective  prob- 
abilities of  these  various  sequences  is 
clearly :  p'^,  p'^q,  p^q,  p^q^  etc.  The 
over-all  probability.  Pa,  that  the  final 
choice  is  an  A,  is  the  sum  of  this  infi- 


nite series  of  sequence  probabilities. 
Thus, 

Pa  =  p^-\-  p\  +  p\  +  p^q^  +  •  •  •  [5] 

Whence,  simplifying,  and  substituting 
for  p  and  q  from  Equations  4a  and  4b 


a2[a  +  2/3] 


[6a] 


^        [a  +  ^][(a  +  0Y  -  a^-\ 
Similarly 

/3T2a  -f  /3] 
[«  +  /3][(a  +  ^y  -  aiS] 

Equation  6a  may  be  written  in  the 
following  form : 

a        [(g  +  ^y  -  /32] 


[6b] 


Pa  = 


a-f/3'[(a  +  |S)2  -a^] 


SO  that  when  a  >  /S,  Pa  > 


/3 


a  +  ^ 


and 


a  +  /3 
Thus  the  difference  between  the 
probabilities  of  the  various  implicit 
responses  occurring  is  accentuated  in 
the  expressions  for  the  probabilities  of 
overt  choice  responses.  The  accentu- 
ation increases  with  K  and  implies 
that  there  is  more  certainty  in  the 
overt  choices  than  in  the  underlying 
processes  which  determine  them.  This 
is  believed  to  be  a  property  which 
many  organisms  exhibit. 

Vicarious  Trial  and  Error 

If  we  identify  alternating  appear- 
ances of  the  "implicit"  responses,  a 
and  b,  with  VTEs,  the  moments  of  the 
distribution  of  VTEs  can  readily  be 
obtained.  Attention  here  will  be  con- 
fined to  the  mean  number  of  VTEs 
preceding  (a)  any  choice  (b)  a  par- 
ticular choice. 

The  Mean  Number  of  VTEs  Preceding 
Any  Choice,  V 

There  are  no  VTEs  if  the  sequence 
of  implicit  responses  is  aa  or  bb. 
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There  is  1  VTE  if  the  sequence  is 
baa  or  abb. 

There  are  2  VTEs  if  the  sequence  is 
abaa  or  babb,  and  so  on. 

Dividing  the  sequences  of  impHcit 
responses  into  those  with  an  odd  num- 
ber and  those  with  an  even  number  of 
VTEs,  the  following  probabiHties  are 
found  (letting  P(F  =  w)  be  the  prob- 
ability of  obtaining  n  VTEs) : 

p{V  =  0)  =  p'  +  q' 
P(V  =  2)  =  p'q  +  pq' 
P(V  =  4)  =  pY  +  PY 
etc. 

P{V  =  1)  =  p'q  +  pq' 

P(F   =    3)    =    p3g2   ^  p2g3 

P{V  =  5)  =  p'q'  -i-p'q* 
etc. 

Now 

V  =  P{V  =  1)  +  2P{V  =  2) 

+  SP{V  =  3)  +  ••■ 

and  after  some  algebraic  manipulation 
and  again  substituting  for  p  and  q 
from  Equation  4a  and  4b. 


V  = 


3a0 


{a  +  /3)2  -  a^ 


[7] 


If  7  =  -  ,  then  Equation  7  may  be  re- 
q: 

written  as 

37 


V  = 


(1  +  7)==  -  7 


Thus  V  is  dependent  only  on  the  ratio 
of  /3  to  a,  and  becomes  a  maximum 
when  7  =  1,  i.e.,  a  =  /3.  Therefore 
the  number  of  VTEs  would  be  a  maxi- 
mum when  Pa  =  Pb  =  h- 

The  Mean  Number  of  VTEs  Preceding 
A  and  B  Responses,  Va  and  Vb 

Separate  consideration  of  the  mean 
number  of  VTEs  preceding  an  A  and 


B  choice  yields  the  following  results 

2afi  ,         /? 


Va  = 
Vb  = 
Since 


(a  +  /3)2  -  a^        a  +  2/3 
2aj8 


+ 
+ 


(a  +  j8)2  -  a^        2a  +  /3 

13 


[8a] 
[8b] 


and 


may  be  re- 


a  -f  2/3  2a  +  iS 

1  ^        1 

written  as  and respec- 

a  p 

-+2  ^+2 

p  a 

tively,  it  can  be  seen  that  on  the 
average  there  would  be  fewer  VTEs 
preceding  the  response  which  is  domi- 
nant at  any  given  moment,  i.e.,  if 
Pa  >  Pb,  Va  <  Vb. 

The  Time  Distribution  of  Final  Choice 

It  is  possible  to  determine  all  the 
moments  of  the  time  distribution  of 
final  responses.  Here,  however,  con- 
sideration will  be  limited  to  the  mean 
latency,  L,  of  all  responses  and  the 
mean  latencies  for  A  and  B  re- 
sponses taken  separately.  La  and  Lb 
respectively. 

The  Mean  Latencies  for  A  and  B  Re- 
sponses, La  and  Lb 

Let  P{a,  t)  be  the  probability  that, 
at  time  /,  no  two  consecutive  a's  or 
6's  have  appeared,  and  that  the  last 
implicit  response  was  an  a.  Let 
P{a,  t;n)  be  the  probability  that,  at 
Line  /,  no  two  consecutive  a's  or  b's 
have  appeared,  and  that  the  last  im- 
plicit response  was  an  a,  and  also  that 
there  have  been  exactly  n  implicit 
responses.    Thus 

00 

P{a,t)  =  E  P{a,t;n) 

n=l 

To  determine  P{a,t;n),  Equation  2 
and  the  method  employed  to  find  Pa 
are  combined. 

Let  P{a;n)  be  the  probability  that 
a  sequence  of  n  events  ends  with  an  a, 
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no  two  consecutive  as  or  b's  having      Similarly  it  may  be  determined  that 
occurred.     Clearly,  ^.      ,_  ,—       x 

P(^0=e-(-+^)'[(^ ^ -l) 


P(a;l)  = 
P(a;2)  = 
P{a;Z)  = 


W  +  ^y 


(a  +  ^y 


,  etc. 


these  probabilities  being  respectiveh- 
associated  with  the  sequences;  a,  ba, 
aba,  etc. 

Now  P(a,t;n)  =  P{n,t).    P{a\n), 
and  Equation  2  gives  P{n,  t),  so  that 

P{a,t;  1) 

=  P(l,t)-P{a;  1) 


V. 


a-\-  (3 


P{a,t;2) 


al3 


{a  +  ^y 


a^th- 


-{a+&)t 


2! 
Similarly 

P{a,t;2>) 

etc.     Hence 


a^^t^e- 


-{a+§)l 


3! 


PiaJ)  =  E  P{a,t;n) 


1! 


"*"  3l  "^ 


which,  upon  simplification,  gives 

2 ~V 


+ 


V^l  2  JJ     M 


Now 


P(a,  t)atdt        I       P(o,  Oa^^ 

S=o  /      J  t=o 

2(a  +  /3) 


(a  +  ^y  -  a/3 
+ 


i8 


(a +  /?)(«  + 2/3) 
and  similarly 

2(a  +  i3) 


[10a] 


Ln  = 


+ 


a/3 


(a  +  /3)(2a  +  ^) 


[10b] 


By  the  same  kind  of  argument  it 
may  be  demonstrated  that  the  mean 
latency  for  all  responses,  L,  is  given  by 


L  = 


2(a+i3)''+a/3         _     2 
[a+^][(a+^)2-ai8]     a+/3 
3al3 


[11.] 


[a+/3][(a+/3)2-a/3] 

Returning   to   Equations    10a   and 

/3 
10b  it  can  be  seen  that 


and 


(a+^)(2a+/3) 
and 


(a+^)(a  +  2^) 
may  be  written  as 
1 


re- 


(a+^)(^  +  2)  (a+^)(^  +  2) 

spectively.  Thus  the  dominant  re- 
sponse will,  on  the  average,  have  a 
shorter  choice  time  than  the  other, 
i.e.,  if  Pa  >  Pb,  La  <  Lb. 

In  order  to  compare  the  theoretical 
response  time  distribution  to  observed 
data,  the  probability  P(0,  t)  of  no  final 
response  having  occurred  by  time  /  is 
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also  given.    This  is  clearly 

P(0,  0  =  P{o,  i)  +  P{a,  /)  +  P{b,  t) 

P{o,  t)  is  given  by  Equation  3  and 
P {a,  t)  and  P{b,t)  by  Equations  9a  and 
9b  so  that,  upon  some  simplification, 

a+/3  .    ,_  ,_  1 

The    Model    and    Descriptors    of   the 
Second  Kind 

At  present,  it  is  only  possible  to  ad- 
vance some  speculations  concerning 
variables  such  as  "degree  of  confi- 
dence" in  the  correctness  of  a  given 
choice.  Nevertheless,  it  seems  worth 
considering  these  since  there  appears 
to  be  a  definite  relation  between  the 
second  kind  of  descriptor  and  the  more 
conventional  indices  of  choice  be- 
havior. Henmon  (1911),  whose  paper 
will  be  considered  in  more  detail 
later,  showed  that  choices  regarded 
by  an  5  with  confidence  are  generally 
quicker  and  more  accurate  than  others. 
This  result  was  demonstrated  in  a 
psychophysical  discrimination  situa- 
tion where  a  definite  correct  choice 
existed. 

There  seem  to  be  two  possible  ways 
in  which  "confidence"  might  be  at- 
tributed to  a  particular  choice.  The 
first  of  these  involves  some  classifica- 
tion of  the  various  sequences  of  im- 
plicit responses  preceding  a  final 
choice.  For  example,  sequences  which 
involve  no  vacillation  at  all,  such  as 
aa,  or  bb,  might  be  regarded  as  "more 
confident"  than  sequences  involving  a 
large  number  of  vacillations,  such  as 
abababaa.  It  will  be  shown  that  this 
kind  of  "confident"  sequence  has  the 
properties  required  by  Henmon's  data. 

For,  suppose  A  be  the  correct  and  B 
the  incorrect  choice  in  a  psychophys- 


ical situation,  then  generally  speaking 
one  would  expect  a  >  /3.  The  prob- 
ability of  the  sequence  aa  would  be 

7 — ; — T7^  and   the   probability  of  bb, 

{a  +  /3)' 

7 — ; — -rrz .    Hence,  the  probability,  Pc, 

{a  +  /3)2 

of  being  correct  for  this  type  of  con- 
fident "choice,"  i.e.,  choosing  A,  is 
given  by 


Pc  = 


a2  +  ^2 


[13] 


Comparing  this  probability  with  the 
overall  probability  of  an  A  response, 
Pa  given  by  Equation  6a, 


Pc-P. 


a2(a  +  2i3) 


a2+/32      [a+/3][(a+/3)2-a/3] 

a^^^ja-^) 

~  [a2+/32][a+j8][(a+/3)=^-a/3] 

[14] 

Clearly,  Equation  14  is  positive  when 
a  >  ^  and  hence  Pc  >  Pa- 

Since  for  these  "confident"  responses 
only  two  implicit  responses  occur  be- 
fore a  final  choice,  it  is  clear  that  their 
mean  response  time  is  shorter  than  the 
over-all  average  response  time.  This 
approach  consists  essentially  in  equat- 
ing "degree  of  confidence"  with  some 
function  of  the  reciprocal  of  the  num- 
ber of  VTEs  preceding  the  final  choice. 

The  second  suggested  approach  to 
judgmental  confidence  is  based  upon 
the  fact  that  these  appraisals  of  a  re- 
sponse, under  normal  instructions,  fol- 
low after  the  response  itself.  Degree 
of  confidence,  therefore,  might  be  as- 
sociated with  implicit  responses  con- 
tinuing to  occur  after  an  overt  choice 
response  has  occurred.  If,  after  an  A 
response  has  been  made,  a  further  a 
occurs  in  the  time  before  the  state- 
ment of  confidence  is  produced,  this 
might  be  taken  to  lead  to  greater  con- 


270 


READINGS  IN   MATHEMATICAL  PSYCHOLOGY 


fidence  than  if  nothing  or  a  6  appeared. 
Indeed,  it  might  be  possible  to  develop 
a  model  for  the  distribution  of  the 
times  between  making  the  primary 
choice  response  and  giving  an  esti- 
mate for  degree  of  confidence  from 
this  kind  of  assumption. 

Other  approaches  to  the  second  kind 
of  descriptor  are  undoubtedly  possible 
within  the  present  scheme.  The  im- 
portant point  is  that  it  is  possible  to 
test  these  various  hypotheses  quite 
easily.  They  each  predict  how  often 
a  given  level  of  confidence  would  be 
employed.  Also  the  expected  distri- 
bution of  descriptors  of  the  first  kind 
associated  with  each  level  of  confi- 
dence can  be  determined. 

The  Agreement  between  the 

Properties  of  the  Model 

AND  Empirical  Data 

The  principal  aim  of  this  paper  is 
to  show  that  a  set  of  very  simple  as- 
sumptions can  be  used  to  derive  rela- 
tions which  might  be  expected  among 
the  variables  observed  in  a  choice 
situation.  In  an  exposition  of  this 
kind  it  is  not  possible  to  examine,  in 
any  detail,  the  success  of  the  model  in 
describing  the  results  of  experiments 
which  are  relevant.  For  one  thing, 
only  the  particular  case  arising  when 
K  =  2  has  been  presented,  whereas  in 
practice  it  may  be  more  profitable  to 
treat  i^  as  a  parameter.  Also,  the 
argument  so  far  presented  is  concerned 
with  the  events  supposed  to  occur  at 
a  single  experimental  trial.  The 
manner  in  which  the  model  is  applied 
to  experimental  data  based  upon  a 
number  of  trials  will  depend  very 
much  upon  the  way  in  which  separate 
trials  resemble  one  another.  There 
may  be  actual  variations  in  stimulus 
conditions  from  trial  to  trial,  or  there 
may  be  a  direct  dependence  of  later 
upon  earlier  trials,  as  in  learning  ex- 
periments.   For  this  reason,  considera- 


tion of  quantitative  evidence  will  be 
mainly  confined  to  an  experiment  by 
Henmon  (1911),  in  which  the  condi- 
tions under  which  individual  trials 
were  conducted  closely  resemble  one 
another  and  where  it  can  reasonably 
be  assumed  that  there  are  no  sys- 
tematic changes  in  an  6"s  behavior. 
This  data  can  therefore  be  regarded 
as  appropriate  for  testing  the  model 
without  there  being  any  need  to  make 
further  special  assumptions.  How- 
ever, before  examining  Henmon 's  re- 
sults, it  seems  worthwhile  to  exhibit 
the  manner  in  which  the  model  seems 
to  match  empirical  evidence  about 
choice  behavior  in  general. 

In  effecting  a  general  appraisal  of 
the  model,  one  is  hindered  by  the 
general  lack  of  individual  results  in 
the  experimental  literature.  For  rea- 
sons which  cannot  be  examined  here 
it  seems  preferable  to  test  hypotheses 
about  functional  relations  upon  indi- 
vidual data.  A  brief  argument  for  this 
point  of  view  has  been  presented  by 
Bakan  (1955)  and  for  the  study  of 
learning  behavior  by  Audley  and 
Jonckheere  (1956).  The  reader  is  re- 
ferred to  these  papers  for  further  de- 
tails. However,  irrespective  of  the 
stand  taken  on  this  question,  it  is 
clear  that  the  present  model  is  con- 
cerned with  individual  results  and  that 
such  results  are  not  generally  avail- 
able. For  this  reason,  the  following 
comparison  of  the  model  with  experi- 
mental evidence  is  largely  qualitative, 
although,  given  appropriate  data, 
quantitative  comparisons  would  have 
been  possible. 

Psychophysical  Discrimination  Situa- 
tions 

In  considering  results  from  psycho- 
physical experiments,  say  using  the 
constant  method,  it  is  necessary  to 
consider  separately  the  comparison  of 
each  variable  with  the  standard.    This 
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is  so  because  no  assumptions  have  thus 
far  been  made  about  the  relation  be- 
tween stimulus  and  response  variables. 
In  spite  of  this,  some  general  predic- 
tions can  be  made. 

Consider  the  results  obtained  from 
the  comparison  of  the  standard  with  a 
particular  variable  stimulus.  In  this 
comparison,  it  can  be  supposed  that 
the  responses  A  and  B  refer  to  the 
respective  statements  "the  variable  is 
greater  than  the  standard"  and  "the 
variable  is  smaller  than  the  standard." 
a  will  clearly  be  a  monotonically  in- 
creasing function  of  the  magnitude  of 
the  variable,  and  j8  a  monotonically 
decreasing  function  of  the  same  mag- 
nitude. At  the  PSE,  a.  =  0.  Within 
limits,  and  certainly  for  a  range  of 
stimuli  close  to  the  PSE,  (a  +  (3)  can 
be  assumed  to  be  approximately  con- 
stant. This  supposition  is  not  crucial, 
but  simplifies  the  ensuing  argument. 

Relation  of  Judgment  Time  to  the  Per- 
ceived Distance  between  Stimuli 

Equation  11  gives  the  mean  choice 
time  as  a  function  of  a  and  j8.  This 
can  be  rewritten  in  the  following  way : 


a+/3 


Ca+^] 


[^-'] 


[15] 


I_f  (a+/3)  is  approximately  constant, 
L  will  depend  principally  upon  the 
product  of  the  parameters,  a^.  Thus 
L  will  have  a  maximum  when  a=/3. 
From  Equation  6a  it  can  be  seen  that 
the  point,  a  =  /3,  also  defines  the  PSE, 
since  for  these  parameter  values  Pa 
=  Pb  =  0.5.  It  can  be  seen  that  deci- 
sion time  will  therefore  rise  mono- 
tonically up  to  the  PSE  and  then  de- 
crease monotonically  beyond  the  PSE. 
For  the  range  and  distribution  of 
stimuli  employed  in  most  psycho- 
physical studies,  the  decrease  in  deci- 
sion time  upon  either  side  of  the  PSE 
will    be,    according    to    the    model. 


approximately  symmetrical.  These 
properties  are  in  agreement  with  em- 
pirical data,  as  for  example  summar- 
ized by  Guilford  (1954). 

Even  where  the  S  is  allowed  three 
categories  of  response,  it  is  the  bound- 
aries between  these  categories  which 
show  peak  decision  times  (Cartwright, 
1941).  This  would  be  expected  if  a 
further  parameter  be  used  to  charac- 
terize "equal"  or  "doubtful"  responses. 
It  would  be  of  great  interest  to  deter- 
mine whether,  in  fact,  a  further  re- 
sponse parameter  is  required  when  a 
third  response  category  is  permitted. 
Almost  by  definition,  the  response 
"doubtful"  implies  that  no  decision 
has  been  reached  by  a  certain  time. 
Such  responses  would  then  appear  to 
be  best  described  by  the  time  which 
the  S  is  willing  to  spend  in  attempting 
to  come  to  a  decision.  This  would 
make  the  range  of  stimuli  over  which 
judgments  of  "doubtful"  are  made 
depend  only  indirectly  upon  differ- 
ential sensitivity.  The  readiness  of 
the  6'  to  continue  attempting  to  arrive 
at  a  definite  answer  would  also  play 
an  important  role.  This  is  in  accord 
with  the  generally  accepted  view  of 
the  use  of  a  third  category,  e.g..  Wood- 
worth  (1938),  Guilford  (1954).  On 
the  other  hand,  a  parameter  to  specify 
judgments  of  "equality"  may  still  be 
required.  This  would  allow  for  a  time 
determined  "doubtful"  judgment  of 
the  kind  discussed  above,  but  would 
also  introduce  a  true  "equals"  cate- 
gory. This  would  enable  an  analysis 
of  the  third  category  to  be  carried  out 
in  accordance  with  the  suggestions  of 
Cartwright  (1941)  and  George  (1917). 

The  Relation  between  Confidence,  Deci- 
sion Time  and  Perceived  Distance 
between  Stimuli 

The  exact  nature  of  the  relations 
between  the  variables  considered  in 
this  section,  will  depend  upon  whether 
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stimulus  conditions  are  the  same  for 
all  trials.  Nevertheless,  some  general 
predictions  can  be  advanced. 

Here,  "degree  of  confidence"  will  be 
equated  with  some  function  of  the 
reciprocal  of  the  number  of  VTEs  pre- 
ceding a  final  choice.  The  number  of 
VTEs  can,  of  course,  range  from  zero 
to  infinity.  Generally  speaking,  con- 
fidence is  rated  upon  some  scale  from 
zero  to  unity.  Let  C,  be  the  degree  of 
confidence  associated  with  a  given 
choice,  and,  V,  the  number  of  VTEs 
preceding  this  choice  act.  Determin- 
ing a  suitable  relation  between  C  and 
F  would,  in  fact,  be  one  of  the  experi- 
mental problems  suggested  by  the 
present  approach.  For  the  moment, 
however,  it  will  be  assumed  that. 


C  = 


1 


V+  1 


[16] 


so  that  when  F  =  0,  C  =  1 ;  and  when 
V  =  00,  C  =  0. 

It  will  be  recalled  from  the  section 
concerned  with  VTEs  that  the  mean 
number  of  these  will,  when  i^  =  2,  be 
two  less  than  the  number  of  implicit 
responses  preceding  a  final  choice. 
Now  it  can  easily  be  demonstrated, 
using  Equation  Ic,  that  the  mean 
choice  time  when  n  implicit  responses 
occur,  Tn,  is  given  by 


T 


a  + 


[17] 


Whence,  since  V  =  n  —  2,  and  be- 
cause n  is  eliminated  from  Equation 
17,  it  is  possible  to  express  the  mean 
choice  time  T,  as  a  function  of  V, 
given  by 

V  -\-  2 

Substituting  for  V  from  Equation  16 
and  adding  an  arbitrary  constant.  To, 
for  the  minimum  choice  time  possible, 


T  = 


1 


+ 


1 


{a-\-^)C       a  +  |3 


+  To.     [19] 


This  hyperbolic  function  is  in  agree- 
ment with  experimental  determina- 
tions of  the  relation  between  confi- 
dence and  judgment  time,  e.g.,  see 
again  Guilford  (1954). 

If  the  stimulus  conditions  are  varied 
between  diff^erent  sets  of  trials,  as  for 
example  in  the  constant  method  dis- 
cussed in  the  previous  section,  general 
conclusions  are  again  possible.  For 
in  discussing  Equation  7,  it  was  shown 
that  the  mean  number  of  VTEs  de- 
pends only  upon  the  ratio  of  a  to  /S. 
Again  assuming  that  (a  -+-  j8)  is  ap- 
proximately constant,  V  would  be  a 
roughly  symmetrical  function  of  the 
magnitude  of  the  variable,  having  a 
maximum  at  the  PSE.  Thus  the 
average  degree  of  confidence,  C,  would 
be  a  roughly  U  shaped  function  hav- 
ing a  minimum  at  the  PSE.  Since 
choice  time  has  been  shown  to  have 
a  maximum  at  the  PSE  and  to  de- 
crease upon  either  side  of  this  point, 
C  and  T  would  again  vary  inversely. 
This  agrees  with  experimental  data 
(see  Guilford,  1954). 

Preference  and  Conflict  Situations 

In  this  kind  of  situation,  a  number 
of  objects  are  paired  and  the  subject 
makes  a  choice  indicating  the  pre- 
ferred object  of  each  pair.  For  any 
given  pair  of  objects,  say  A  and  B, 
the  parameters  a  and  /3  can  be  taken 
to  represent  some  measure  of  prefer- 
ence for  A  and  B.  Because  there  are 
a  number  of  objects,  it  is  more  con- 
venient to  label  the  r  objects  presented 
to  the  subject  as  Xi,  and  to  let  the 
parameter  associated  with  a  kind  of 
"absolute  preference"  for  each,  be 
ai  (i  =  1,  2,  •  •  •  r).  The  a  and  /S  of 
the  equations  will  now  be  replaced  by, 
say  aj  and  ak,  for  the  comparison  of 
the  ^'th  and  jth.  objects,  Xj  and  Xk. 
This,  of  course,  is  to  make  the  very 
strong  assumption  that  the  a/s  are  in- 
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dependent  of  the  particular  compari- 
son in  which  they  are  involved.  This 
assumption  could  be  readily  tested  by 
using  the  model  appropriately,  and  is 
accepted  here  only  in  order  to  simplify 
notation.  The  results  of  the  following 
argument  would  be  qualitatively  the 
same,  even  if  there  were  in  fact, 
contextual  effects  peculiar  to  each 
comparison. 

Variation  in  choice  time  among  dif- 
ferent comparisons.  The  set  of  r  ob- 
jects, on  the  basis  of  a  paired  compari- 
son technique,  can  usually  be  ranked. 
Let  i  be  an  individual's  ranking  of 
an    object,    so    that    we    may    write 

X^>X2>   ■   ■   ■  >Xi>Xi+,>   ■   ■   •  >Xr, 

meaning  X,-  is  preferred  to  X2  and  so 
on.  This  means  that  ai  >  0:2  >  •  •  • 
>  a,  >  tti+i  >  ■  ■  >  ar.  Consider  any 
pair  of  parameters,  say  uj  and  ak,  and 
let  these  be  the  a  and  /3  of  the  earlier 
equations.  Then  the  mean  choice 
time  is  given  by  Equation  11,  and  this 
can  now  be  rewritten  as 


Lu.k)  — 


ocj  +  ak 


SajUk 


[aj  -\-  a/t][(ay  +  oikY  —  aya*] 


[20] 


Clearly  L(^j,k)  depends  upon  two  things ; 
the  sum  of  the  parameters  (ay  +  a^) 
and,  secondly,  the  product  of  the  pa- 
rameters, ujUk.  Other  things  being 
equal,  the  choice  time  will  decrease 
as  (ay  -f-  ak)  increases.  Again,  with 
(ay  -j-  ah)  constant,  L(j,k)  will  increase 
with  the  product,  reaching  a  maximum 
when  ay  =  aA:.  Choice  time  will  there- 
fore (o)  depend  upon  the  general  level 
of  preference  for  objects,  being  quicker 
for  preferred  objects,  (b)  will  be  quicker 
the  greater  the  difference  in  preference 
for  the  two  paired  objects.  This  in 
agreement  with  experimental  finding, 
e.g.,  for  children  choosing  among 
liquids  to  drink,  Barker  (1942),  for 
aesthetic  preferences,  Dashiell  (1937). 


It  will  be  interesting  to  determine 
how  far  the  assumption  of  an  absence 
of  contextual  effects  can  be  main- 
tained. If  the  assumption  turns  out 
to  be  approximately  true,  then  the 
parameters,  at,  would  provide  a  means 
of  scaling  the  stimulus  objects  for  a 
given  individual.  In  essence,  such  an 
approach  would  resemble  that  adopted 
by  Bradley  and  Terry  (1952),  but 
would  have  the  added  advantage  that 
the  scale  values  would  have  an  abso- 
lute rather  than  a  relative  basis,  so 
that  the  scale  values  should  be  un- 
affected by  the  inclusion  of  new 
comparisons. 

Number  of  VTEs  for  different  com- 
parisons. It  was  shown,  in  discussing 
Equation  7,  that  the  mean  number  of 
VTEs  in  a  given  situation,  depends 
entirely  upon  the  ratio  of  a  to  (3. 
Using  the  present  notation  this  would 
be  the  ratio  of  aj  to  ajt,  for  objects  Xj 
and  Xk.  The  number  of  VTEs  has  a 
maximum  when  ay  =  ak,  and  de- 
creases as  the  values  of  the  parameter 
become  more  disparate.  Thus  the 
number  of  VTEs  should  depend  en- 
tirely upon  the  differences  in  prefer- 
ence and  not  upon  the  general  level  of 
preference  for  the  two  paired  objects. 
Thus  for  adjacent  objects,  Xi  and 
Xi+1,  the  number  of  VTEs  before  a 
final  choice  will  not  rise  with  choice 
time  as  one  proceeds  from  preferred  to 
nonpreferred  objects.  This  is  slightly 
complicated  by  differences  in  "prefer- 
ence distance"  between  adjacent  ob- 
jects, but  the  prediction  is  again  found 
to  be  in  agreement  with  experimental 
evidence,  e.g.,  see  Barker  (1942). 

Learning  in  choice  situations.  It  is 
in  considering  learning  behavior  that 
the  need  for  individual  results  is 
greatest  (Audley  &  Jonckheere,  1956). 
The  full  advantages  of  the  present 
approach  to  response  variables  can 
only  be  gained  by  incorporating  the 
assumption  in  a  stochastic  model  for 
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learning.  The  way  in  which  this 
might  be  contrived,  when  K  =  1,  has 
already  been  outlined  and  illustrated 
elsewhere  (Audley:  1957,  1958).  On 
the  whole,  therefore,  the  experimental 
literature  does  not  provide  results  in 
a  way  which  enable  the  predictions  of 
the  model  to  be  falsified,  even  at  a 
qualitative  level.  The  most  that  can 
be  done  here  is  to  show  that  the  pre- 
dictions might  well  be  good  approxi- 
mations to  the  properties  of  learning 
data. 

Given  a  particular  theory  of  learn- 
ing it  would,  of  course,  be  possible  to 
anchor  the  theory  more  closely  to  re- 
sponse variables  by  identifying  the 
parameter  of  the  choice  model  with  an 
appropriate  theoretical  construction. 

The  properties  of  the  model  and 
simple  learning  behavior.  Consider, 
for  example,  learning  in  a  simple  two- 
choice  situation.  Let  a  be  associated 
with  A,  the  correct  response,  and  /3 
with  B,  the  incorrect  response.  The 
way  in  which  a  and  0  vary  with  re- 
ward and  punishment  is  naturally  a 
matter  for  investigation  and  would 
certainly  condition  the  form  of  the 
prediction  which  would  be  made. 
Nevertheless,  it  is  not  unreasonable  to 
assume  that  a  will  be  some  monotonic 
increasing  function,  and  ^  some  mono- 
tonic  decreasing  function  of  practice 
and  of  punishments  and  rewards. 

Let  it  be  supposed  that  the  S  has 
at  first  a  strong  tendency  to  produce 
the  incorrect  choice,  i.e.,  a  is  small 
relative  to  /3.  Consider,  firstly,  what 
might  be  expected  to  happen  to  the 
over-all  latency  L,  and  the  latencies 
of  A  and  B,  La  and  Lb  respectively. 
In  discussing  Equations  10a  and  10b 
it  was  shown  that  the  dominant  re- 
sponse, on  the  average,  will  have  the 
shorter  choice  time.  Thus  in  the  first 
place  it  will  be  expected  that  La  will 
be  greater  than  L-  until  the  prob- 
ability of  making  the  correct  choice, 


Pa,  reaches  and  exceeds  0.5,  when  La 
will  be  generally  shorter  than  Lb. 

All  of  the  latencies  are  dependent 
upon  two  factors,  the  sum  (a  -f  ^)  and 
the  ratio  of  a  to  /S.  The  over-all  la- 
tency, L,  if  (a  +  j8)  remains  constant, 
will  rise  to  a  maximum  until  Pa 
=  Pb  =  0.5  (i.e.,  a  =  |8)  and  then 
fall  again.  Superimposed  upon  this 
rise  and  fall  will  be  the  influence  of 
(a  -f  /3),  and  if  the  levels  of,  say 
punishment  and  reward,  are  such  as 
to  disturb  the  constancy  of  this  quan- 
tity, then  there  will  be  an  accentua- 
tion or  flattening  of  the  curve  of 
latency  as  a  function  of  practice.  The 
monotonic  decline  in  response  la- 
tencies observed  when  an  5  is  intro- 
duced into  a  learning  situation  for  the 
first  time  does  not  counter  this  predic- 
tion. For,  then,  it  is  to  be  expected 
that  (a  -f  /3)  will  be  initially  small  and 
the  effect  of  increasing  a,  and,  hence, 
{a  +  /8)  will  be  reinforced  by  the  grow- 
ing difference  in  magnitude  between  a 
and  /?.  In  original  learning,  therefore, 
the  two  factors  work  together  and 
produce  the  monotonic  decrease  in 
latency. 

The  number  of  VTEs,  from  Equa- 
tion 7,  is  seen  to  be  a  function  only  of 
the  ratio  of  a  to  /8.  Thus  VTEs  would 
be  expected  to  rise  to  a  maximum  until 
a  =  13,  i.e..  Pa  =  Pb  =  0.5,  and  the 
decline. 

These  predictions  are  probably  only 
applicable  to  the  very  simple  two- 
choice  situations  so  far  considered. 
For  discrimination  studies,  the  prob- 
lem is  complicated  by  the  way  in 
which  the  relevant  cues  are  being 
utilized  by  the  organism  and  there  is 
no  point  in  reviewing  the  controversy 
over  this  matter.  It  does  however 
seem  worthwhile  pointing  out  that,  in 
discrimination  behavior,  it  is  very 
probable  that  there  appears  something 
like  the  problem  of  the  use  of  the  third 
category    in    psychophysical    proced- 
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ures.  That  is,  a  distinction  seems  to 
be  necessary  between,  on  the  one 
hand,  a  definite  act  of  choice  and,  on 
the  other  hand,  behavior  which  occurs 
simply  because  something  has  to  be 
done  in  the  situation.  This  specula- 
tive point  is  raised  because  the  size 
of  the  parameters  may  exert  an  influ- 
ence upon  behavior  in  two  ways. 
Firstly,  by  determining  the  prob- 
ability of  making  a  particular  response 
when  a  "true"  choice  is  made  and, 
secondly,  by  determining  the  prob- 
ability that  a  "true"  choice  is  made. 

Henmon's  experiment.  The  experi- 
ment conducted  by  Henmon  (1911)  is 
of  particular  interest,  because  it  pro- 
vides data  from  individual  5s,  in  a 
situation  where  stimulus  conditions 
can  be  assumed  to  be  fairly  constant 
from  trial  to  trial.  The  observations, 
therefore,  are  important  for  any  model 
concerned  with  the  properties  of 
choice  behavior. 

Henmon  required  5s,  in  each  of 
1,000  trials,  to  decide  whether  one  of 
two  horizontal  lines  was  longer  or 
shorter  than  the  other.  The  lengths 
of  the  lines  were  always  20  mm  and 
20.3  mm  respectively.  In  addition, 
5s  were  instructed  to  indicate  their 
confidence  in  each  judgment. 

The  model  is  qualitatively  in  agree- 
ment with  Henmon's  data,  except  in 
two  things.  Firstly,  although  aver- 
age choice  time  for  wrong  responses  is 
larger  than  that  for  correct  choices,  as 
predicted  by  the  model,  the  wrong 
responses  are  relatively  quicker  in 
each  category  of  confidence.  The 
second  qualitative  difference  appears 
in  examining  accuracy  as  a  function 
of  time.  There  is  some  indication  for 
some  5s  that  although  there  is  a 
general  decline  in  accuracy  with  longer 
choice  times,  again  predicted  by  the 
model,  there  is  also  a  slight  rise  in 
accuracy  in  going  from  very  short  to 
moderately  short  choice  times.     It  is 


possible  that  both  of  these  differences 
may  be  accounted  for  by  a  suitable 
analysis  of  judgments  of  confidence 
about  which  only  a  few  speculations 
have  been  advanced  in  the  present 
paper.  The  important  point,  it  seems 
to  the  author,  is  that  the  general 
stochastic  model  is  capable  of  dealing 
with  this  kind  of  issue,  rather  than 
that  it  succeeds  in  all  details  at  the 
present  time. 

Henmon  gives  the  distribution  of 
all  choice  times  for  each  individual. 
Since  this  can  also  be  derived  from 
the  model,  a  comparison  of  the  two 
distributions  should  give  further  indi- 
cations as  to  the  adequacy  of  the 
present  approach  to  choice  behavior. 
In  testing  the  goodness  of  fit  of  the 
model  in  this  matter,  it  would  be 
usual  to  estimate  the  parameters  from 
the  distribution  of  choice  times  alone. 
However,  it  was  decided  that  perhaps 
a  stronger  case  could  be  made  out  if 
the  only  time  datum  used  to  estimate 
the  parameters  was  the  mean  latency. 
Two  equations  are  of  course  required 
if  values  of  a  and  /3  are  to  be  deter- 
mined, and  Pa,  the  probability  of  a 
correct  response,  was  chosen  for  the 
second.  Accordingly  the  present  esti- 
mates are  based  upon  Equations  6a 
and  11. 

There  must,  of  course,  be  some 
minimum  response  time  before  which 
no  response  can  occur.  This  is  not 
easy  to  determine  from  Henmon's 
tables  of  results,  because  the  data  are 
already  grouped  in  intervals  of  200 
milliseconds.  For  this  reason,  the 
minimum  possible  time  was  estimated 
in  the  following  way.  For  various 
assumed  minimum  times,  estimates  of 
a  and  jS  were  determined,  and  the 
theoretical  distribution  of  choice  times 
computed.  The  value  leading  to  the 
best  fit  was  then  adopted.  This  is  not 
entirely  a  satisfactory  procedure,  but 
with  K  assumed  to  be  2,  and  with  no 
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Subject  Bl 

Subject  Br 

Time  interval  in 

Observed 

Expected 

Time  interval  in 

Observed 

Expected 

milliseconds 

Frequency 

Frequency 

milliseconds 

Frequency 

Frequency 

100- 

(2)" 

— 

100-299 

(2)- 

— 

300- 

57 

53 

300- 

350 

352 

500- 

214 

229 

500- 

381 

398 

700- 

220 

229 

700- 

170 

165 

900- 

159 

168 

900- 

65 

57 

1100- 

113 

111 

1100- 

26 

19 

1300- 

85 

83 

1300- 

5 

6 

1500- 

74 

48 

Above  1500 

1 

3 

1700- 

32 

30 

1900- 

18 

20 

1000 

1000 

2100- 

11 

10 

2300- 

8 

8 

Above  2500- 

7 
1000 

11 
1000 

These  observations  ignored  in  calculations. 


direct  indication  of  the  minimum 
time,  it  seemed  the  best  available  in 
the  circumstances.  The  results  for 
Henmon's  (1911,  Table  2,  p.  194)  5s 
Bl  and  Br  are  considered  below. 

For  Bl,  the  minimum  possible  time 
was  taken  to  be  about  0.40  sec.  On 
this  basis  a  =  3.19  and /3  =  1.28,  these 
values  referring  to  a  time  scale  meas- 
ured in  seconds.  For  Br,  the  mini- 
mum time  was  taken  to  be  0.34  sec. 
giving  a  =  6.68  and  13  =  4.28.  A 
comparison  of  the  observed  and  ex- 
pected distributions  of  response  times 
is  given  in  Table  1.  The  agreement 
between  model  and  data  seems  to  be 
reasonably  good. 

Concluding  Remarks 

On  the  whole,  there  is  a  certain 
looseness  in  the  way  in  which  many 
contemporary  theories  and  even  local 
hypotheses  are  linked  to  observed  re- 
sponse variables.  It  seems  worth- 
while, therefore,  to  try  to  determine 
whether  these  variables  might  not  be 
related  to  one  another  by  relatively 


simple  laws  which  operate  in  most 
choice  situations.  In  this  way,  not 
only  are  descriptions  of  choice  be- 
havior considerably  simplified,  but 
better  ways  of  formulating  and  testing 
theories  are  suggested.  The  model 
itself  is  naturally  also  a  theory  about 
a  certain  aspect  of  behavior,  and  as 
such  needs  to  be  tested. 

In  this  presentation  of  the  general 
stochastic  model  the  intention  is  to 
indicate  the  potentialities  of  the  ap- 
proach, rather  than  to  make  specific 
tests  of  the  case  arising  when  K  =  2. 
It  is  not  to  be  expected  that  the  two 
simple  assumptions  will  alone  account 
for  the  relations  existing  between  re- 
sponse variables  in  a  wide  diversity  of 
situations.  Each  situation  will  un- 
doubtedly have  certain  unique  condi- 
tions which  have  to  be  taken  into  ac- 
count. But  the  model  does  seem  to 
share  certain  important  properties 
with  choice  behavior  and  therefore  it 
appears  to  be  a  reasonable  initial 
working  hypothesis.  It  can  be  tested 
in  great  detail  against  data,  and  the 
parameters  are  of  a  kind  which  could 
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be  identified  with  either  psychological 
or  physiological  constructs. 

Methods  of  estimating  parameters 
and  statistical  tests  of  goodness  of  fit 
will  be  discussed  elsewhere.  For  the 
present  model,  neither  of  these  pro- 
cedures involves  any  novel  problems. 
For  example,  given  the  probability  of 
occurrence  of  one  of  the  alternative 
responses  and  the  over-all  mean  re- 
sponse time,  Equations  6  and  11  may 
be  easily  solved  to  give  the  appro- 
priate parameter  values. 
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A  MATHEMATICAL  MODEL  FOR  SIMPLE  LEARNING 

BY  ROBERT  R.  BUSH  i  AND  FREDERICK  HOSTELLER 
Harvard  University  2 


Introduction 

Mathematical  models  for  empirical 
phenomena  aid  the  development  of  a 
science  when  a  sufficient  body  of  quan- 
titative information  has  been  accumu- 
lated. This  accumulation  can  be  used 
to  point  the  direction  in  which  models 
should  be  constructed  and  to  test 
the  adequacy  of  such  models  in  their 
interim  states.  Models,  in  turn,  fre- 
quently are  useful  in  organizing  and 
interpreting  experimental  data  and  in 
suggesting  new  directions  for  experi- 
mental research .  Among  the  branches 
of  psychology,  few  are  as  rich  as  learn- 
ing in  quantity  and  variety  of  available 
data  necessary  for  model  building. 
Evidence  of  this  fact  is  provided  by 
the  numerous  attempts  to  construct 
quantitative  models  for  learning  phe- 
nomena. The  most  recent  contribu- 
tion is  that  of  Estes  (2). 

In  this  paper  we  shall  present  the 
basic  structure  of  a  new  mathematical 
model  designed  to  describe  some  simple 
learning  situations.  We  shall  focus 
attention  on  acquisition  and  extinction 
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in  experimental  arrangements  using 
straight  runways  and  Skinner  boxes, 
though  we  believe  the  model  is  more 
general;  we  plan  to  extend  the  model 
in  order  to  describe  multiple-choice 
problems  and  experiments  in  generali- 
zation and  discrimination  in  later 
papers.  Wherever  possible  we  shall 
discuss  the  correspondence  between 
our  model  and  the  one  being  developed 
by  Estes "  (2) ,  since  striking  parallels 
do  exist  even  though  many  of  the 
basic  premises  differ.  Our  model  is 
discussed  and  developed  primarily  in 
terms  of  reinforcement  concepts  while 
Estes'  model  stems  from  an  attempt 
to  formalize  association  theory.  Both 
models,  however,  may  be  re-inter- 
preted in  terms  of  other  sets  of  con- 
cepts. This  state  of  affairs  is  a  com- 
mon feature  of  most  mathematical 
models.  An  example  is  the  particle 
and  wave  interpretations  of  modern 
atomic  theory. 

We  are  concerned  with  the  type  of 
learning  which  has  been  called  "instru- 
mental conditioning"  (5),  "operant 
behavior"  or  "type  R  conditioning" 
(10),  and  not  with  "classical  condi- 
tioning" (5),  "Pavlovian  conditioning" 
or  "type  S  conditioning"  (10).  We 
shall  follow  Sears  (9)  in  dividing  up 
the  chain  of  events  as  follows:  (1)  per- 
ception of  a  stimulus,  (2)  performance 
of  a  response  or  instrumental  act,  (3) 
occurrence  of  an  environmental  event, 
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and  (4)  execution  of  a  goal  response. 
Examples  of  instrumental  responses 
are  the  traversing  of  a  runway,  press- 
ing of  a  lever,  etc.  By  environmental 
events  we  mean  the  presentation  of  a 
"reinforcing  stimulus"  (10)  such  as 
food  or  water,  but  we  wish  to  include 
in  this  category  electric  shocks  and 
other  forms  of  punishment,  removal 
of  the  animal  from  the  apparatus,  the 
sounding  of  a  buzzer,  etc.  Hence  any 
change  in  the  stimulus  situation  which 
follows  an  instrumental  response  is 
called  an  environmental  event.  A  goal 
response,  such  as  eating  food  or  drink- 
ing water,  is  not  necessarily  involved 
in  the  chain.  It  is  implied,  however, 
that  the  organism  has  a  motivation 
or  drive  which  corresponds  to  some 
goal  response.  Operationally  speak- 
ing, we  infer  a  state  of  motivation 
from  observing  a  goal  response. 

Probabilities  and  How  They  Change 

As  a  measure  of  behavior,  we  have 
chosen  the  probability,  p,  that  the 
instrumental  response  will  occur  dur- 
ing a  specified  time,  h.  This  proba- 
bility will  change  during  conditioning 
and  extinction  and  will  be  related  to 
experimental  variables  such  as  latent 
time,  rate,  and  frequency  of  choices. 
The  choice  of  the  time  interval,  h,  will 
be  discussed  later.  We  conceive  that 
the  probability,  p,  is  increased  or  de- 
creased a  small  amount  after  each 
occurrence  of  the  response  and  that 
the  determinants  of  the  amount  of 
change  in  p  are  the  environmental 
events  and  the  work  or  effort  expended 
in  making  the  response.  In  addition, 
of  course,  the  magnitude  of  the  change 
depends  upon  the  properties  of  the 
organism  and  upon  the  value  of  the 
probability  before  the  response  oc- 
curred. For  example,  if  the  proba- 
bility was  already  unity,  it  could  not 
be  increased  further. 

Our  task,  then,  is  to  describe  the 


change  in  probability  which  occurs 
after  each  performance  of  the  response 
being  studied.  We  wish  to  express 
this  change  in  terms  of  the  probability 
immediately  prior  to  the  occurrence 
of  the  response  and  so  we  explicitly 
assume  that  the  change  is  independent 
of  the  still  earlier  values  of  the  proba- 
bility. For  convenience  in  describing 
the  step-wise  change  in  probability, 
we  introduce  the  concept  of  a  mathe- 
matical operator.  The  notion  is  ele- 
mentary and  in  no  way  mysterious: 
an  operator  Q  when  applied  to  an 
operand  p  yields  a  new  quantity  Qp 
(read  Q  operating  on  p).  Ordinary 
mathematical  operations  of  addition, 
multiplication,  differentiation,  etc., 
may  be  defined  in  terms  of  operators. 
For  the  present  purpose,  we  are  inter- 
ested in  a  class  of  operators  Q  which 
when  applied  to  our  probability  p  will 
give  a  new  value  of  probability  Qp. 
As  mentioned  above,  we  are  assuming 
that  this  new  probability,  Qp,  can  be 
expressed  in  terms  of  the  old  value,  p. 
Supposing  Qp  to  be  a  well-behaved 
function,  we  can  expand  it  as  a  power 
series  in  p: 


Qp  =  ao-\-  aip  +  a2p^  + 


(1) 


L 


where  ao,  ai,  a2,  •  •  •  are  constants  inde- 
pendent of  p.  In  order  to  simplify  the 
mathematical  analysis  which  follows, 
we  shall  retain  only  the  first  two  terms 
in  this  expansion.  Thus,  we  are  as- 
suming that  we  can  employ  operators 
which  represent  a  linear  transforma- 
tion on  p.  If  the  change  is  small,  one 
would  expect  that  this  assumption 
would  provide  an  adequate  first  ap- 
proximation. Our  operator  Q  is  then 
completely  defined  as  soon  as  we 
specify  the  constants  ao  and  ai;  this 
is  the  major  problem  at  hand.  For 
reasons  that  will  soon  be  apparent,  we 
choose  to  let  ao  =  a  and  ai  =  I  —a  —  b. 
This  choice  of  parameters  permits  us 
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to  write  our  operator  in  the  form 

Qp  =  p  +  a{\-p)-  hp.      (2) 

This  is  our  basic  operator  and  equation 
(2)  will  be  used  as  the  cornerstone  for 
our  theoretical  development.  To  main- 
tain the  probability  between  0  and  1, 
the  parameters  a  and  b  must  also  lie 
between  0  and  1.  Since  a  is  positive, 
we  see  that  the  term,  a(l  —  p),  of 
equation  (2)  corresponds  to  an  incre- 
ment in  p  which  is  proportional  to  the 
maximum  possible  increment,  (1  —  p). 
Moreover,  since  b  is  positive,  the  term, 
—  bp,  corresponds  to  a  decrement  in  p 
which  is  proportional  to  the  maximum 
possible  decrement,  —p.  Therefore, 
we  can  associate  with  the  parameter  a 
those  factors  which  always  increase 
the  probability  and  with  the  param- 
eter b  those  factors  which  always  de- 
crease the  probability.  It  is  for  these 
reasons  that  we  rewrote  our  operator 
in  the  form  given  in  equation  (2). 

We  associate  the  event  of  presenting 
a  reward  or  other  reinforcing  stimulus 
with  the  parameter  a,  and  we  assume 
that  a  =  0  when  no  reward  is  given 
as  in  experimental  extinction.  With 
the  parameter  b,  we  associate  events 
such  as  punishment  and  the  work 
required  in  making  the  response.  (See 
the  review  by  Solomon  [11]  of  the 
influence  of  work  on  behavior.)  In 
many  respects,  our  term,  a(l  —  p), 
corresponds  to  an  increment  in  "ex- 
citatory potential"  in  Hull's  theory 
(6)  and  our  term,  —bp,  corresponds  to 
an  increment  in  Hull's  "inhibitory 
potential." 

In  this  paper,  we  make  no  further 
attempt  to  relate  our  parameters,  a 
and  b,  to  experimental  variables  such 
as  amount  of  reward,  amount  of  work, 
strength  of  motivation,  etc.  In  com- 
paring our  theoretical  results  with 
experimental  data,  we  will  choose 
values  of  a  and  b  which  give  the  best 
fit.     In  other  words,  our  model  at  the 


present  time  is  concerned  only  with 
the  form  of  conditioning  and  extinc- 
tion curves,  not  with  the  precise  values 
of  parameters  for  particular  conditions 
and  particular  organisms. 

Continuous  Reinforcement  and 
Extinction 

Up  to  this  point,  we  have  discussed 
only  the  effect  of  the  occurrence  of  a 
response  upon  the  probability  of  that 
response.  Since  probability  must  be 
conserved,  i.e.,  since  in  a  time  interval 
h  an  organism  will  make  some  response 
or  no  response,  we  must  investigate 
the  effect  of  the  occurrence  of  one 
response  upon  the  probability  of  an- 
other response.  In  a  later  paper,  we 
shall  discuss  this  problem  in  detail, 
but  for  the  present  purpose  we  must 
include  the  following  assumption.  We 
conceive  that  there  are  two  general 
kinds  of  responses,  overt  and  non- 
overt.  The  overt  responses  are  sub- 
divided into  classes  A,  B,  C,  etc.  If 
an  overt  response  A  occurs  and  is 
neither  rewarded  nor  punished,  then 
the  probability  of  any  mutually  ex- 
clusive overt  response  B  is  not  changed. 
Nevertheless,  the  probability  of  that 
response  A  is  changed  after  an  occur- 
rence on  which  it  is  neither  rewarded 
nor  punished.  Since  the  total  proba- 
bility of  all  responses  must  be  unity, 
it  follows  that  the  probability  gained 
or  lost  by  response  A  must  be  compen- 
sated by  a  corresponding  loss  or  gain 
in  probability  of  the  non-overt  re- 
sponses. This  assumption  is  impor- 
tant in  the  analysis  of  experiments 
which  use  a  runway  or  Skinner  box, 
for  example.  In  such  experiments  a 
single  class  of  responses  is  singled  out 
for  study,  but  other  overt  responses 
can  and  do  occur.  We  defer  until  a 
later  paper  the  discussion  of  experi- 
ments in  which  two  or  more  responses 
are  reinforced  differentially. 

With  the  aid  of  our  mathematical 
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operator  of  equation  (2)  we  may  now 
describe  the  progressive  change  in  the 
probability  of  a  response  in  an  experi- 
ment such  as  the  Graham-Gagn^  run- 
way (3)  or  Skinner  box  (10)  in  which 
the  same  environmental  events  follow 
each  occurrence  of  the  response.  We 
need  only  apply  our  operator  Q  re- 
peatedly to  some  initial  value  of  the 
probability  p.  Each  application  of 
the  operator  corresponds  to  one  occur- 
rence of  the  response  and  the  sub- 
sequent environmental  events.  The 
algebra  involved  in  these  manipula- 
tions is  straightforward.  For  example, 
if  we  apply  Q  to  p  twice,  we  have 

Q'P  =  QiQp)  =  a-\-  (I  -  a  -  b)Qp 
=  a+  (1  -a-&) 

X  [a  +  (1  -  a  -  b)pj     (3) 

Moreover,  it  may  be  readily  shown 
that  if  we  apply  Q  to  p  successively  n 
times,  we  have 


(2-/,= 


a-f  & 


(iTT-.-^) 

X  (1  -  a  -  b)\     (4) 


Provided  a  and  b  are  not  both  zero  or 
both  unity,  the  quantity  (1  —  a  —  b)" 
tends  to  an  asymptotic  value  of  zero 
as  n  increases.  Therefore,  Q"p  ap- 
proaches a  limiting  value  of  a/ (a  +  b) 
as  n  becomes  large.  Equation  (4) 
then  describes  a  curve  of  acquisition. 

It  should  be  noticed  that  the  asymp- 
totic value  of  the  probability  is  not 
necessarily  either  zero  or  unity.  For 
example,  ii  a  =  b  (speaking  roughly 
this  implies  that  the  measures  of  re- 
ward and  work  are  equal) ,  the  ultimate 
probability  of  occurrence  in  time  h  of 
the  response  being  studied  is  0.5. 

Since  we  have  assumed  that  a  =  0 
when  no  reward  is  given  after  the 
response  occurs,  we  may  describe  an 
extinction  trial  by  a  special  operator 
E  which  is  equivalent  to  our  operator 


Q  of  equation  (2)  with  a  set  equal  to 
zero: 

Ep  =  p  -  bp  =  (i  -  b)p.      (5) 

It  follows  directly  that  if  we  apply 
this  operator  E  to  p  successively  for  n 
times  we  have 


E^p  -  (1  -  byp. 


(6) 


This  equation  then  describes  a  curve 
of  experimental  extinction. 

Probability,  Latent  Time,  and  Rate 

Before  the  above  results  on  continu- 
ous reinforcement  and  extinction  can 
be  compared  with  empirical  results, 
we  must  first  establish  relationships 
between  our  probability,  p,  and  ex- 
perimental measures  such  as  latent 
time  and  rate  of  responding.  In  order 
to  do  this,  we  must  have  a  model. 
A  simple  and  useful  model  is  the  one 
described  by  Estes  (2).  Let  the  ac- 
tivity of  an  organism  be  described  by  a 
sequence  of  responses  which  are  inde- 
pendent of  one  another.  (For  this 
purpose,  we  consider  doing  "nothing" 
to  be  a  response.)  The  probability 
that  the  response  or  class  of  responses 
being  studied  will  occur  first  is  p. 
Since  we  have  already  assumed  that 
non-reinforced  occurrences  of  other 
responses  do  not  affect  p,  one  may 
easily  calculate  the  mean  number  of 
responses  which  will  occur  before  the 
response  being  studied  takes  place. 
Estes  (2)  has  presented  this  calcula- 
tion and  shown  that  the  mean  number 
of  responses  which  will  occur,  includ- 
ing the  one  being  studied,  is  simply  \lp. 
In  that  derivation  it  was  assumed  that 
the  responses  were  all  independent  of 
one  another,  i.e.,  that  transition  prob- 
abilities between  pairs  of  responses  are 
the  same  for  all  pairs.  This  assump- 
tion is  a  bold  one  indeed  (it  is  easy  to 
think  of  overt  responses  that  cannot 
follow  one  another),  but  it  appears  to 
us  that  any  other  assumption  would 
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require  a  detailed  specification  of  the 
many  possible  responses  in  each  ex- 
perimental arrangement  being  consid- 
ered. (Miller  and  Frick  [8]  have 
attempted  such  an  analysis  for  a  par- 
ticular experiment.)  It  is  further 
assumed  that  every  response  requires 
the  same  amount  of  time,  h,  for  its 
performance.  The  mean  latent  time, 
then,  is  simply  h  times  the  mean  num- 
ber of  responses  which  occur  on  a 
"trial": 


L  =  - 
P 


(7) 


The  time,  h,  required  for  each  response 
will  depend,  of  course,  on  the  organism 
involved  and  very  likely  upon  its 
strength  of  drive  or  motivation. 

The  mean  latent  time,  L,  is  ex- 
pressed in  terms  of  the  probability,  p, 
by  equation  (7),  while  this  probability 
is  given  in  terms  of  the  number  of 
trials,  n,  by  equation  (4).  Hence  we 
may  obtain  an  expression  for  the  mean 
latent  time  as  a  function  of  the  num- 
ber of  trials.  It  turns  out  that  this 
expression  is  identical  to  equation  (4) 
of  Estes'  paper  (2)  except  for  differ- 
ences in  notation.  (Estes  uses  T  in 
place  of  our  n;  our  use  of  a  difference 
equation  rather  than  of  a  differential 
equation  gives  us  the  term  (1  —  a  —  b) 
instead  of  Estes'  e~^.)  Estes  fitted 
his  equation  to  the  data  of  Graham 
and  Gagne  (3).  Our  results  differ 
from  Estes'  in  one  respect,  however: 
the  asymptotic  mean  latent  time  in 
Estes'  model  is  simply  h,  while  we 
obtain 


L. 


-(^) 


(8) 


This  equation  suggests  that  the  final 
mean  latent  time  depends  on  the 
amount  of  reward  and  on  the  amount 
of  required  work,  since  we  have  as- 
sumed that  a  and  b  depend  on  those 
two  variables,  respectively.  This  con- 
clusion seems  to  agree  with  the  data 


of  Grindley  (4)  on  chicks  and  the  data 
of  Crespi  (1)  on  white  rats. 

Since  equation  (7)  is  an  expression 
for  the  mean  time  between  the  end  of 
one  response  of  the  type  being  studied 
and  the  end  of  the  next  response  of  the 
type  being  studied,  we  may  now  cal- 
culate the  mean  rate  of  responding  in 
a  Skinner-box  arrangement.  If  i  rep- 
resents the  mean  time  required  for  the 
occurrence  of  n  responses,  measured 
from  some  arbitrary  starting  point, 
then  each  occurrence  of  the  response 
being  studied  adds  an  increment  in  i 
as  follows: 

At        h 

If  the  increments  are  sufficiently  small, 
we  may  write  them  as  differentials  and 
obtain  for  the  mean  rate  of  responding 


dn 
It 


Oip, 


(10) 


where  co  =  l/h.  We  shall  call  w  the 
"activity  level"  and  by  definition  « 
is  the  maximum  rate  of  responding 
which  occurs  when  p  =  \  obtains. 

The  Free-Responding  Situation 

In  free-responding  situations,  such 
as  that  in  Skinner  box  experiments,  one 
usually  measures  rate  of  responding  or 
the  cumulative  number  of  responses 
versus  time.  To  obtain  theoretical 
expressions  for  these  relations,  we  first 
obtain  an  expression  for  the  proba- 
bility />  as  a  function  of  time.  From 
equation  (2),  we  see  that  if  the  re- 
sponse being  studied  occurs,  the  change 
in  probability  is  A^  =  a{\  —  p)  —  bp. 
We  have  already  assumed  that  if  other 
responses  occur  and  are  not  reinforced, 
no  change  in  the  probability  of  occur- 
rence of  the  response  being  studied  will 
ensue.  Hence  the  expected  change  in 
probability  during  a  time  interval  h 
is  merely  the  change  in  probability 
times  the  probability  p  that  the  re- 
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sponse  being  studied  occurs  in  that      time  /.     The  result  is 
time  interval : 

n  =  3—; —  \uit-\-  -  log  [j>o(l  +  u) 


Expected  (Ap) 

=  p{a{\-p)-hp].     (11) 

The  expected  rate  of  change  of  proba- 
bility with  time  is  then  this  expression 
divided  by  the  time  h.  Writing  this 
rate  as  a  derivative  we  have 


dp 
dt 


=  o:p{a{l  -p)  -  hp\     (12) 


where,  as  already  defined,  w  =  1/^  is 
the  activity  level.  This 'equation  is 
easily  integrated  to  give  p  as  an  ex- 
plicit function  of  time  t.  Since  equa- 
tion (10)  states  that  the  mean  rate  of 
responding,  dn/dt,  is  co  times  the  prob- 
ability p,  we  obtain  after  the  inte- 
gration 


dn 

H^  Pq{\+u)-\-[\ 


oipo 


-^0(1 +«)>-""' 


=  F 
(13) 


where  we  have  let  u  =  h/a.  The 
initial  rate  of  responding  at  f  =  0  is 
Fo  =  co^o,  and  the  final  rate  after  a 
very  long  time  /  is 


Foo  =  [ 


dvA  w  ci) 

dt\t==oc~'i.+u~l  +  bla 


(14) 


Equation  (13)  is  quite  similar  to  the 
expression  obtained  by  Estes  except 
for  our  inclusion  of  the  ratio  u  =  b/a. 
The  final  rate  of  responding  according 
to  equation  (14),  increases  with  a  and 
hence  with  the  amount  of  reward  given 
per  response,  and  decreases  with  b  and 
hence  with  the  amount  of  work  per 
response.  These  conclusions  do  not 
follow  from  Estes'  results  (2). 

An  expression  for  the  cumulative 
number  of  responses  during  continu- 
ous reinforcement  is  obtained  by  inte- 
grating equation  (13)  with  respect  to 


1  +  M 


X  (1  -  e-""')  +  e-""'] 


(15) 


As  the  time  /  becomes  very  large,  the 
exponentials  in  equation  (15)  approach 
zero  and  n  becomes  a  linear  function 
of  time.  This  agrees  with  equation 
(14)  which  says  that  the  asymptotic 
rate  is  a  constant.  Both  equations 
(13)  and  (15)  for  rate  of  responding 
and  cumulative  number  of  responses, 
respectively,  have  the  same  form  as 
the  analogous  equations  derived  by 
Estes  (2)  which  were  fitted  by  him  to 
data  on  a  bar-pressing  habit  of  rats. 
The  essential  difference  between  Estes' 
results  and  ours  is  the  dependence, 
discussed  above,  of  the  final  rate  upon 
amount  of  work  and  amount  of  reward 
per  trial. 

We  may  extend  our  analysis  to  give 
expressions  for  rates  and  cumulative 
responses  during  extinction.  Since  we 
have  assumed  that  a  =  0  during  ex- 
tinction, we  have  in  place  of  equa- 
tion (12) 


dt 


(16) 


which  when  integrated  for  p  and  mul- 
tiplied by  CO  gives 


dm 

It 


Oipe 


1    +   Oibpjt 


(17) 


where  pe  is  the  probability  at  the  be- 
ginning of  extinction.  The  rate  at  the 
beginning  of  extinction  is  Ve  =  cope. 
Hence  we  ma}/  write  equation  (17)  in 
the  form 


V  = 


dm 
dt 


Ve 


1  +  Vebt 


(18) 


An  integration  of  this  equation  gives 
for  the  cumulative  number  of  extinc- 
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tion  responses 

m  =  ^  log  [1  +  VebO 

0 

=  ^log 


(^•) 


(19) 


This  result  is  similar  to  the  empirical 
equation  m  =  K  log  t,  used  by  Skinner 
in  fitting  experimental  response  curves 
(10).  Our  equation  has  the  additional 
advantage  of  passing  through  the  ori- 
gin as  it  must. 

It  may  be  noted  that  the  logarithmic 
character  of  equation  (19)  implies  that 
the  total  number  of  extinction  re- 
sponses, m,  has  no  upper  limit.  Thus, 
if  our  result  is  correct,  and  indeed  if 
Skinner's  empirical  equation  is  correct, 
then  there  is  no  upper  limit  to  the 
size  of  the  "reserve"  of  extinction  re- 
sponses. For  all  practical  purposes, 
however,  the  logarithmic  variation  is 
so  slow  for  large  values  of  the  time  t, 
it  is  justified  to  use  some  arbitrary 
criterion  for  the  "completion"  of  ex- 
tinction. We  shall  consider  extinction 
to  be  "complete"  when  the  mean  rate 
of  responding  V  has  fallen  to  some 
specified  value,  F/.  Thus,  the  "total" 
number  of  extinction  responses  from 
this  criterion  is 


1,  Ve 

m.=-log_ 


(20) 


We  now  wish  to  express  this  "total" 
number  of  extinction  responses,  mr, 
as  an  explicit  function  of  the  number 
of  preceding  reinforcements,  n.  The 
only  quantity  in  equation  (20)  which 
depends  upon  n  is  the  rate,  Ve,  at  the 
beginning  of  extinction.  If  we  assume 
that  this  rate  is  equal  to  the  rate  at 
the  end  of  acquisition,  we  have  from 
equations  (4)  and  (10) 

Tr         ^^ 
at 

-  (F^ax-  Fo)(l  -a -by     (21) 


where  we  have  let 


'max    — 


a  +  h' 


(22) 


and  where  Fo  =  co^o  is  the  rate  at  the 
beginning  of  acquisition.  If  we  now 
substitute  equation  (21)  into  equation 
(20),  we  obtain 


niT  =  -7 


1,         J    F^ax  rF„,ax  Fol 

X  (1  -  a  -  &)«| 


(23) 


This  result  may  be  compared  with  the 
data  of  Williams  (12)  obtained  by 
measuring  the  "total"  number  of  ex- 
tinction responses  after  5,  10,  30  and 
90  reinforcements.  From  the  data, 
the  ratio  Fmax/F/  was  estimated  to 
be  about  5,  and  the  ratio  Fo/F/  was 
assumed  to  be  about  unity.  Values 
ofa  =  0.014  and  &  =  0.026  were  chosen 
in  fitting  equation  (23)  to  the  data. 
The  result  is  shown  in  the  figure. 

Fixed  Ratio  and  Random  Ratio 
Reinforcement 

In  present  day  psychological  lan- 
guage, the  term  "fixed  ratio"  (7)  refers 
to  the  procedure  of  rewarding  every 
^th  response  in  a  free-responding  situ- 
ation (^  =  2,  3,  •  •  •)•  In  a  "random 
ratio"  schedule,  an  animal  is  rewarded 
on  the  average  after  k  responses  but  the 
actual  number  of  responses  per  reward 
varies  over  some  specified  range.  We 
shall  now  derive  expressions  for  mean 
rates  of  responding  and  cumulative 
numbers  of  responses  for  these  two 
types  of  reinforcement  schedules.  If 
we  apply  our  operator  Q,  of  equation 
(2),  to  a  probability  p,  and  then  apply 
our  operator  E,  of  equation  (5),  to  Qp 
repeatedly  for  (^  —  1)  times,  we  obtain 

(E''-'Q)p=il-b)''-'[p+a{l-p)-bp^ 
=  p-^a'(l-p)-b'p     (24) 
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"Total"  number  of  extinction  responses  as  a  function  of  the  number  of  reinforcements. 
Curve  plotted  from  equation  (23)  with  h  =  0.026,  o  =  0.014,  F„u»x  =  5Fo,  V,  =  Fo. 
Data  from  Williams  (12). 


where 

=  a\\-{k-\)h^---\^a     (25) 
and 


The  symbol  ^  means  "approximately 
equal  to."  In  the  present  case  the 
exact  approach  would  be  to  retain  the 
primes  on  a  and  h  throughout;  how- 
ever the  approximations  provide  a  link 
with  the  previous  discussion.  The 
approximations  on  the  right  of  these 
two  equations  are  justified  if  kh  is 
small  compared  to  unity.  Now  the 
mean  change  in  p  per  response  will  be 
the  second  and  third  terms  of  equation 
(24)  divided  by  k: 


a'  ,        b' 

Ap  =  -(l  -  p)  -jp 


^-(l  -  p)  -bp. 


(27) 


This  equation  is  identical  to  our  result 
for  continuous  reinforcement,  except 
that  a'/k  replaces  a  and  b'/k  replaces  b. 
We  may  obtain  a  similar  result  for 
the  "random  ratio"  schedule  as  fol- 
lows: After  any  response,  the  proba- 
bility that  Q  operates  on  p  is  l/k  and 
the  probability  that  E  operates  on 
p  IS  (\  —  l/k).  Hence  the  expected 
change  in  p  per  response  is 

Expected  (Ap)  =  tQP 

-f  (1  -  \/k)Ep  -  p.     (28) 

After  equations  (2)  and  (5)  are  inserted 
and  the  result  simplified,  we  obtain 
from  equation  (28) 

Expected  (Ap) 


-> 


P)  -  bp.     (29) 


This  result  is  identical  to  the  approxi- 
mate result  shown  in  equation  (27)  for 
the  fixed  ratio  case.  Since  both  equa- 
tions  (27)   and   (29)   have   the  same 
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form  as  our  result  for  the  continuous 
reinforcement  case,  we  may  at  once 
write  for  the  mean  rate  of  responding 
an  equation  identical  to  equation  (13), 
except  that  a  is  replaced  by  a'/k. 
Similarly,  we  obtain  an  expression  for 
the  final  rate  of  responding  identical 
to  equation  (14)  except  that  a  is  re- 
placed by  a'/k.  This  result  is  meant 
to  apply  to  both  fixed  ratio  and  ran- 
dom ratio  schedules  of  reinforcement. 
In  comparing  the  above  result  for 
the  asymptotic  rates  with  equation 
(14)  for  continuous  reinforcement,  we 
must  be  careful  about  equating  the 
^activity  level,  w,  for  the  three  cases 
(continuous,  fixed  ratio  and  random 
ratio  reinforcements).  Since  1/co  rep- 
resents the  minimum  mean  time  be- 
tween successive  responses,  it  includes 
both  the  eating  time  and  a  "recovery 
time."  By  the  latter  we  mean  the 
time  necessary  for  the  animal  to  re- 
organize itself  after  eating  and  get  in 
a  position  to  make  another  bar  press 
or  key  peck.  In  the  fixed  ratio  case, 
presumably  the  animal  learns  to  look 
for  food  not  after  each  press  or  peck, 
as  in  the  continuous  case,  but  ideally 
only  after  every  k  response.  There- 
fore both  the  mean  eating  time  and 
the  mean  recovery  time  per  response 
are  less  for  the  fixed  ratio  case  than 
for  the  continuous  case.  In  the  ran- 
dom ratio  case,  one  would  expect  a 
similar  but  smaller  difference  to  occur. 
Hence,  it  seems  reasonable  to  conclude 
that  the  activity  level,  co,  would  be 
smaller  for  continuous  reinforcement 
than  for  either  fixed  ratio  or  random 
ratio,  and  that  w  would  be  lower  for 
random  ratio  than  for  fixed  ratio  when 
the  mean  number  of  responses  per 
reward  was  the  same.  Moreover,  we 
should  expect  that  co  would  increase 
with  the  number  of  responses  per  re- 
ward, k.  Even  if  eating  time  were 
subtracted  out  in  all  cases  we  should 
expect  these  arguments  to  apply. 
Without  a  quantitative  estimate  of 


the  mean  recovery  time,  we  see  no 
meaningful  way  of  comparing  rates  of 
responding  under  continuous  reinforce- 
ment with  those  under  fixed  ratio  and 
random  ratio,  nor  of  comparing  rates 
under  different  ratios  (unless  both 
ratios  are  large).  The  difficulty  of 
comparing  rates  under  various  rein- 
forcement schedules  does  not  seem  to 
be  a  weakness  of  our  model,  but  rather 
a  natural  consequence  of  the  experi- 
mental procedure.  However,  the  im- 
portance of  these  considerations  hinges 
upon  the  orders  of  magnitude  involved, 
and  such  questions  are  empirical  ones. 

Aperiodic  and  Periodic  Reinforcement 

Many  experiments  of  recent  years 
were  designed  so  that  an  animal  was 
reinforced  at  a  rate  aperiodic  or  peri- 
odic in  time  (7).  The  usual  procedure 
is  to  choose  a  set  of  time  intervals, 
Ti,  •  •  • ,  Tn,  which  have  a  mean  value 
T.  Some  arrangement  of  this  set  is 
used  as  the  actual  sequence  of  time 
intervals  between  rewards.  The  first 
response  which  occurs  after  one  of 
these  time  intervals  has  elapsed  is 
rewarded. 

To  analyze  this  situation  we  may 
consider  k,  the  mean  number  of  re- 
sponses per  reward,  to  be  equal  to  the 
mean  time  interval  T  multiplied  by 
the  mean  rate  of  responding: 

dft 
k  =  T-t:  =  Tc^p.  (30) 

at 

Equation  (29)  for  the  expected  change 
in  probability  per  response  is  still  valid 
if  we  now  consider  ^  to  be  a  variable 
as  expressed  by  equation  (30).  Thus, 
the  time  rate  of  change  of  p  is 

f  =  f;(l-,)-»6,l      (31) 

With  a  little  effort,  this  differential 
equation  may  be  integrated  from  0  to 
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t  to  give 
dn 


{s-l)  +  (s-^l)Ke-"''i^ 
1  -  Ke-""'^ 


(32) 


where 


z  =  2o}Tb/a,  (33) 

5  =  Vl  +  22,  (34) 

i:=(l+2:^o-5)/(14-2/'o  +  5).    (35) 

For  arbitrarily  large  times  /,  the  final 
rate  is 


(  dn\ 
\  dt  )t. 


=  -(s-l). 


(36) 


For  sufficiently  large  values  of  T, 
z  becomes  large  compared  to  unity 
and  we  may  write  approximately 

(-^)        =  'oV27z  =  o}^la/bu)T.  (37) 

Thus,  for  large  values  of  T,  the  final 
rate  varies  inversely  as  the  square  root 
of  T. 

Periodic  reinforcement  is  a  spe- 
cial case  of  aperiodic  reinforcement 
in  which  the  set  of  time  intervals, 
Ti,  •  •  • ,  Tn,  discussed  above,  consists 
of  a  single  time  interval,  T.  Thus, 
all  the  above  equations  apply  to  both 
periodic  and  aperiodic  schedules.  One 
essential  difference  is  known,  however. 
In  the  periodic  case  the  animal  can 
learn  a  time  discrimination,  or  as  is 
sometimes  said,  eating  becomes  a  cue 
for  not  responding  for  a  while.  This 
seems  to  be  an  example  of  stimulus 
discrimination  which  we  will  discuss 
in  a  later  paper. 

Extinction  After  Partial  Reinforcement 
Schedules 

In  the  discussion  of  extinction  in 
earlier  sections,  it  may  be  noted  that 
the  equations  for  mean  rates  and 
cumulative  responses  depended  on  the 


previous  reward  training  only  through 
Ve,  the  mean  rate  at  the  beginning  of 
extinction.  Hence,  we  conclude  that 
equations  (18)  and  (19)  apply  to 
extinction  after  any  type  of  reinforce- 
ment schedule.  However,  the  quan- 
tities Ve  and  b  in  our  equations  may 
depend  very  much  on  the  previous 
training.  Indeed,  if  our  model  makes 
any  sense  at  all,  this  must  be  the  case, 
for  "resistance"  to  extinction  is  known 
to  be  much  greater  after  partial  rein- 
forcement training  than  after  a  con- 
tinuous reinforcement  schedule  (7). 

Since  the  rate  at  the  start  of  extinc- 
tion, Ve,  is  nearly  equal  to  the  rate  at 
the  end  of  acquisition,  it  will  certainly 
depend  on  the  type  and  amount  of 
previous  training.  However,  the  log- 
arithmic variation  in  equations  (19) 
and  (20)  is  so  slow,  it  seems  clear  that 
empirical  results  demand  a  dependence 
of  b  on  the  type  of  reinforcement 
schedule  which  preceded  extinction. 
We  have  argued  that  b  increases  with 
the  amount  of  work  required  per  re- 
sponse. We  will  now  try  to  indicate 
how  the  required  work  might  depend 
upon  the  type  of  reinforcement  sched- 
ule, even  though  the  lever  pressure  or 
key  tension  is  the  same.  For  con- 
tinuous reinforcement,  the  response 
pattern  which  is  learned  by  a  pigeon, 
for  example,  involves  pecking  the  key 
once,  lowering  its  head  to  the  food 
magazine,  eating,  raising  its  head,  and 
readjusting  its  body  in  preparation  for 
the  next  peck.  This  response  pattern 
demands  a  certain  amount  of  effort. 
On  the  other  hand,  the  response  pat- 
tern which  is  learned  for  other  types 
of  reinforcement  schedules  is  quite 
different;  the  bird  makes  several  key 
pecks  before  executing  the  rest  of 
the  pattern  just  described.  Thus  we 
would  expect  that  the  average  work 
required  per  key  peck  is  considerably 
less  than  for  continuous  reinforcement. 
This  would  imply  that  b  is  larger  and 
thus  "resistance"  to  extinction  is  less 
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for  continuous  reinforcement  than  for 
all  other  schedules.  This  deduction 
is  consistent  with  experimental  results 
(7).  However,  this  is  just  part  of  the 
story.  For  one  thing,  it  seems  clear 
that  it  is  easier  for  the  organism  to 
discriminate  between  continuous  rein- 
forcement and  extinction ;  we  have  not 
handled  this  effect  here. 

Summary 

A  mathematical  model  for  simple 
learning  is  presented.  Changes  in  the 
probability  of  occurrence  of  a  response 
in  a  small  time  h  are  described  with 
the  aid  of  mathematical  operators. 
The  parameters  which  appear  in  the 
operator  equations  are  related  to  exper- 
imental variables  such  as  the  amount 
of  reward  and  work.  Relations  be- 
tween the  probability  and  empirical 
measures  of  rate  of  responding  and 
latent  time  are  defined.  Acquisition 
and  extinction  of  behavior  habits  are 
discussed  for  the  simple  runway  and 
for  the  Skinner  box.  Equations  of 
mean  latent  time  as  a  function  of  trial 
number  are  derived  for  the  runway 
problem;  equations  for  the  mean  rate 
of  responding  and  cumulative  numbers 
of  responses  versus  time  are  derived 
for  the  Skinner  box  experiments.  An 
attempt  is  made  to  analyze  the  learn- 
ing process  with  various  schedules  of 
partial  reinforcement  in  the  Skinner 
type  experiment.  Wherever  possible, 
the  correspondence  between  the  pres- 


ent model  and  the  work  of  Estes  (2) 
is  pointed  out. 
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A  MODEL  FOR  STIMULUS  GENERALIZATION 
AND  DISCRIMINATION 

BY  ROBERT  R.  BUSH»  AND  FREDERICK  MOSTELLER 
Harvard  University^ 


Introduction 

The  processes  of  stimulus  generali- 
zation and  discrimination  seem  as  fun- 
damental to  behavior  theory  as  the 
simple  mechanisms  of  reinforcement 
and  extinction  are  to  learning  theory. 
Whether  or  not  this  distinction  be- 
tween learning  and  behavior  is  a  useful 
one,  there  can  be  little  doubt  that  few 
if  any  applications  of  behavior  theory 
to  practical  problems  can  be  made 
without  a  clear  exposition  of  the  phe- 
nomena of  generalization  and  discrimi- 
nation. It  is  our  impression  that  few 
crucial  experiments  in  this  area  have 
been  reported  compared  with  the  num- 
ber of  important  experiments  on  simple 
conditioning  and  extinction.  Perhaps 
part  of  the  reason  for  this  is  that  there 
are  too  few  theoretical  formulations 
available.  That  is  to  say,  we  con- 
ceive that  explicit  and  quantitative 
theoretical  structures  are  useful  in 
guiding  the  direction  of  experimental 
research  and  in  suggesting  the  type  of 
data  which  are  needed. 

In  this  paper  we  describe  a  model, 
based  upon  elementary  concepts  of 
mathematical  set  theory.  This  model 
provides  one  possible  framework  for 
analyzing  problems  in  stimulus  gen- 
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eralization  and  discrimination.  Fur- 
ther, we  shall  show  how  this  model 
generates  the  basic  postulates  of  our 
previous  work  on  acquisition  and  ex- 
tinction (1),  where  the  stimulus  situa- 
tion as  defined  by  the  experimenter 
was  assumed  constant. 

Stated  in  the  simplest  terms,  gener- 
alization is  the  phenomenon  in  which 
an  increase  in  strength  of  a  response 
learned  in  one  stimulus  situation  im- 
plies an  increase  in  strength  of  response 
in  a  somewhat  different  stimulus  sit- 
uation. When  this  occurs,  the  two 
situations  are  said  to  be  similar.  Al- 
though there  are  several  intuitive 
notions  as  to  what  is  meant  by  "simi- 
larity," one  usually  means  the  proper- 
ties which  give  rise  to  generalization. 
We  see  no  alternative  to  using  the 
amount  of  generalization  as  an  opera- 
tional definition  of  degree  of  "simi- 
larity." In  the  model,  however,  we 
shall  give  another  definition  of  the 
degree  of  similarity,  but  this  definition 
will  be  entirely  consistent  with  the 
above-mentioned  operational  defini- 
tion. 

We  also  wish  to  clarify  what  we 
mean  by  stimulus  discrimination.  In 
one  sense  of  the  term,  all  learning  is  a 
process  of  discrimination.  Our  usage 
of  the  term  is  a  more  restricted  one, 
however.  We  refer  specifically  to  the 
process  by  which  an  animal  learns  to 
make  response  A  in  one  stimulus  situ- 
ation and  response  B  (or  response  A 
with  different  "strength")  in  a  differ- 
ent stimulus  situation.  We  are  not 
at  the  moment  concerned  with,  for 
example,  the  process  by  which  an 
animal  learns  to  discriminate  between 
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various  possible  responses  in  a  fixed 
stimulus  situation. 

As  prototypes  of  the  more  general 
problems  of  stimulus  generalization 
and  discrimination,  we  shall  consider 
the  following  two  kinds  of  experiments : 

(i)  An  animal  is  trained  to  make  a 
particular  response,  by  the  usual  rein- 
forcement procedure,  in  an  experimen- 
tally defined  stimulus  situation.  At  the 
end  of  training,  the  response  has  a  certain 
strength  or  probability  of  occurrence. 
The  animal  is  then  "tested"  in  a  new 
stimulus  situation  similar  to  the  training 
one  and  in  which  the  same  response, 
insofar  as  it  is  experimentally  defined, 
is  possible.  One  then  asks  about  the 
strength  or  probability  of  occurrence  of 
the  response  in  this  new  stimulus  situa- 
tion and  how  it  depends  on  the  degree  of 
similarity  of  the  new  situation  to  the  old 
stimulus  situation. 

(ii)  An  animal  is  presented  alternately 
with  two  stimulus  situations  which  are 
similar.  In  one,  an  experimentally  de- 
fined response  is  rewarded,  and  in  the 
other  that  response  is  either  not  re- 
warded or  rewarded  less  than  in  the  first. 
Through  the  process  of  generalization, 
the  effects  of  rewards  and  non-rewards 
in  one  stimulus  situation  influence  the 
response  strength  in  the  other,  but  even- 
tually the  animal  learns  to  respond  in 
one  but  not  in  the  other,  or  at  least  to 
respond  with  different  probabilities  (rates 
or  strengths).  One  then  asks  how  the 
probability  of  the  response  in  each  situa- 
tion varies  with  the  number  of  training 
trials,  with  the  degree  of  similarity  of  the 
two  situations,  and  with  the  amount  of 
reward. 

We  do  not  consider  that  these  two 
kinds  of  experiments  come  close  to  ex- 
hausting the  problems  classified  under 
the  heading  of  generalization  and  dis- 
crimination, but  we  do  believe  that 
they  are  fundamental.  Thus,  the 
model  to  be  described  has  been  de- 
signed to  permit  analysis  of  these 
experiments.  In  the  next  section  we 
will  present  the  major  features  of  the 
model,  and  in  later  sections  we  shall 


apply  it  to  the  above  described  ex- 
periments. 

The  Model 

We  shall  employ  some  of  the  ele- 
mentary notions  of  mathematical  set 
theory  to  define  our  model.     A  par- 
ticular stimulus  situation,  such  as  an 
experimental  box  with  specific  prop- 
erties (geometrical,  optical,  acoustical, 
etc.)  is  regarded  as  separate  and  dis- 
tinct from  the  rest  of  the  universe. 
Thus,  we  shall  denote  this  situation 
by  a  set  of  stimuli  which  is  part  of  the 
entire  universe  of  stimuli.     The  ele- 
ments of  this  set  are  undefined  and  we 
place  no  restriction  on  their  number. 
This  lack  of  definition  of  the  stimulus 
elements  does  not  give  rise  to  any 
serious  difficulties  since  our  final  results 
involve  neither  properties  of  individual 
elements  nor   numbers   of   such   ele- 
ments.    We  next  introduce  the  notion 
of  the  measure  of  a  set.     If  the  set  con- 
sists of  a  finite  number  of  elements,  we 
may  associate  with  each  element  a  posi- 
tive number  to  denote  its  "weight"; 
the  measure  of  such  a  set  is  the  sum  of 
all  these  numbers.      Intuitively,  the 
weight  associated  with  an  element  is 
the  measure  of  the  potential  impor- 
tance of  that  element  in  influencing 
the  organism's  behavior.     More  gen- 
erally, we  can  define  a  density  function 
over  the  set;  the  measure  is  the  inte- 
gral of  that  function  over  the  set. 

To  bridge  the  gap  between  stimuli 
and  responses,  we  shall  borrow  some 
of  the  basic  notions  of  Estes  (2). 
(The  concept  of  reinforcement  will 
play  an  integral  role,  however.)  It  is 
assumed  that  stimulus  elements  exist 
in  one  of  two  states  as  far  as  the 
organism  involved  is  concerned;  since 
the  elements  are  undefined,  these  states 
do  not  require  definition  but  merely 
need  labelling.  However,  we  shall 
speak  of  elements  which  are  in  one 
state  as  being  "conditioned"  to  the 
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response,  and  of  elements  in  the  other 
state  as  being  "non-conditioned." 

On  a  particular  trial  or  occurrence 
of  a  response  in  the  learning  process, 
it  is  conceived  that  an  organism  per- 
ceives a  sub-set  of  the  total  stimuli 
available.  It  is  postulated  that  the 
probability  of  occurrence  of  the  re- 
sponse in  a  given  time  interval  is  equal 
to  the  measure  of  the  elements  in  the 
sub-set  which  had  been  previously  con- 
ditioned, divided  by  the  measure  of 
the  entire  sub-set.  Speaking  roughly, 
the  probability  is  the  ratio  of  the  im- 
portance of  the  conditioned  elements 
perceived  to  the  importance  of  all  the 
elements  perceived.  It  is  further  as- 
sumed that  the  sub-set  perceived  is 
conditioned  to  the  response  if  that 
response  is  rewarded. 

The  situation  is  illustrated  in  Fig.  1. 
It  would  be  wrong  to  suppose  that  the 
conditioned  and  non-conditioned  ele- 
ments are  spatially  separated  in  the 
actual  situation  as  Fig.  1  might  sug- 
gest; the  conditioned  elements  are 
spread  out  smoothly  among  the  non- 
conditioned  ones.  In  set-theoretic 
notation,  we  then  have  for  the  proba- 
bility of  occurrence  of  the  response 


P  = 


m(Xr\  C) 
m{X) 


(1) 


where  m{  )  denotes  the  measure  of 
any  set  or  sub-set  named  between  the 
parentheses,  and  where  Xr\  C  indi- 
cates the  intersection  of  X  and  C  (also 
called  set-product,  meet,  or  overlap  of 
X  and  C).  We  then  make  an  assump- 
tion of  equal  proportions  in  the  meas- 
ures so  that 


P  = 


m{Xr\  C)       m(C) 


m{X) 


miS) 


(2) 


Heuristically,  this  assumption  of 
equal  proportions  can  arise  from  a 
fluid  model.  Suppose  that  the  total 
situation  is  represented  by  a  vessel 
containing  an  ideal  fluid  which  is  a 


Fig.  1.  Set  diagram  of  the  single  stimulus 
situation  S  with  the  various  sub-sets  involved 
in  a  particular  trial.  C  is  the  sub-set  of 
elements  previously  conditioned,  X  the  sub- 
set of  S  perceived  on  the  trial.  The  sub-sets 
A  and  B  are  defined  in  the  text. 


mixture  of  two  substances  which  do 
not  chemically  interact  but  are  com- 
pletely miscible.  For  discussion  let 
the  substances  be  water  and  alcohol 
and  assume,  contrary  to  fact,  that  the 
volume  of  the  mixture  is  equal  to  the 
sum  of  the  partial  volumes.  The  vol- 
"ume  of  the  water  corresponds  to  the 
measure  of  the  sub-set  of  non-condi- 
tioned stimuli,  S  —  C  (total  set  minus 
the  conditioned  set),  and  the  volume 
of  the  alcohol  corresponds  to  the 
measure  of  the  sub-set  C  of  condi- 
tioned stimuli.  The  sub-set  X  corre- 
sponds to  a  thimbleful  of  the  mixture 
and  of  course  if  the  fluids  are  well 
mixed,  the  volumetric  fraction  of  alco- 
hol in  a  thimbleful  will  be  much  the 
same  as  that  in  the  whole  vessel. 
Thus  the  fraction  of  measure  of  con- 
ditioned stimuli  in  X  will  be  equal  to 
the  fraction  in  the  whole  set  S,  as 
expressed  by  equation  (2).  Our  defi- 
nition of  p  is  essentially  that  of  Estes 
(2)  except  that  where  he  speaks  of 
number  of  elements,  we  speak  of  the 
measure  of  the  elements. 

We  next  consider  another  stimulus 
situation  which  we  denote  by  a  set  S'. 
In  general  this  new  set  S'  will  not  be 
disjunct  from  the  set  S,  i.e.,  S  and  S' 
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will  intersect  or  overlap  as  shown  in 
Fig.  2.     We  denote  the  intersection  by 


I  =  Sn  y . 


(3) 


We   can   now   define   an    index  of 
similarity  of  S'  to  5  by 


rj(5'  to  S)  = 


mil) 
m{S')  ' 


(4) 


In  words  this  definition  says  that  the 
index  of  similarity  of  S'  to  5  is  the 
measure  of  their  intersection  divided 
by  the  measure  of  the  set  S'.  (Our 
notation  makes  clear  that  we  have 
made  a  tacit  assumption  that  the 
measure  of  an  element  or  set  of  ele- 
ments is  independent  of  the  set  in 
which  it  is  measured.)  Definition  (4) 
also  gives  the  index  of  similarity  of 
5  to  S'  as 


7i{S  to  S')  = 


m(I) 
m{S) 
_  m(S') 
~  m{S) 


(5) 


7,(5'  to  S). 


From  this  last  equation  it  is  clear  that 
the  similarity  of  S'  to  6"  may  not  be 
the  same  as  the  similarity  of  5  to  S'. 
In  fact,  if  the  measure  of  the  inter- 
section is  not  zero,  the  two  indices  are 
equal  only  if  the  measures  of  5  and  S' 
are  equal.  It  seems  regrettable  that 
similarity,  by  our  definition,  is  non- 
symmetric.  However,  we  do  not  care 
to  make  the  general  assumption  that 
(a)  the  measures  of  all  situations  are 
equal  and  at  the  same  time  make  the 
assumption  that  (b)  measures  of  an 
element  or  set  of  elements  is  the  same 
in  each  situation  in  which  it  appears. 
For  then  the  importance  of  a  set  of 
elements,  say  a  light  bulb,  would  have 
to  be  the  same  in  a  small  situation, 
say  a  2'  X  2'  X  2'  box,  as  in  a  large 
situation,  say  a  ballroom.  Further 
this  pair  of  assumptions,  (a)  and  (b), 
leads  to  conceptual  difficulties. 


The  Generalization  Problem 

We  are  now  in  a  position  to  say 
something  about  the  first  experimental 
problem  described  in  the  Introduction. 
An  animal  is  trained  to  make  a  re- 
sponse in  one  stimulus  situation  and 
then  his  response  strength  is  measured 
in  a  similar  situation.  After  the  ani- 
mal has  been  trained  in  the  first  situa- 
tion whose  elements  form  the  set  S,  a 
sub-set  C  of  5  will  have  been  condi- 
tioned to  the  response  as  shown  in 
Fig.  2.  But  part  of  the  sub-set  C  is 
also  contained  in  the  second  situation 
whose  elements  form  the  set  S';  we 
denote  this  part  by  Cr\  S'. 

From  the  discussion  preceding  equa- 
tions (1)  and  (2),  we  can  easily  see 
that  the  probability  of  the  respjonse 
occurring  in  S'  is 


P'- 


mjCnS') 


(6) 


We  now  use  the  assumption  of  equal 
proportions  so  that 


m{Cr\S')       m(Cnl)       m(C) 


m{I) 


■m(I)  w(5) 


(7) 


The  first  equality  in  this  equation 
follows  from  the  fact  that  the  only 
part  of  C  which  is  in  S'  is  in  the  inter- 


FiG.  2.  Diagram  of  two  similar  stimulus 
situations  after  conditioning  in  one  of  them. 
The  situation  in  which  training  occurred  is 
denoted  by  the  set  S;  the  sub-set  C  of  5 
represents  the  portion  of  5  which  was  con- 
ditioned to  the  response.  The  new  stimulus 
situation  in  which  the  response  strength  is  to 
be  measured  is  represented  by  the  set  S', 
and  the  intersection  of  S'  and  5  is  denoted  by 
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section  /  as  shown  in  Fig.  2.  The 
second  equahty  in  equation  (,7)  is  an 
application  of  our  assumption  that  the 
measure  of  C  is  uniformly  distributed 
over  5  and  so  the  intersection  contains 
the  same  fraction  of  measure  of  C  as 
does  the  entire  set  S. 

If  now  we  combine  equations  (6) 
and  (7),  we  obtain 


P'  = 


m{I)    m(C) 
m(S')'m(S)  ■ 


(8) 


From  equation  (4)  we  note  that  the 
first  ratio  in  equation  (8)  is  the  index 
of  similarity  of  S'  to  S,  while  from 
equation  (2)  we  observe  that  the  sec- 
ond ratio  in  equation  (8)  is  merely 
the  probability  p  of  the  response  in  S. 
Hence 

p'  =  v(S'  to  S)p.  (9) 

Equation  (9)  now  provides  us  with 
the  necessary  operational  definition  of 
the  index  of  similarity,  7j(6"  to  S),  of 
the  set  S'  to  the  set  5.  The  proba- 
bilities p  and  p'  of  the  response  in  S 
and  S',  respectively,  can  be  measured 
either  directly  or  through  measure- 
ments of  latent  time  or  rate  of  re- 
sponding (1).  Therefore,  with  equa- 
tion (9),  we  have  an  operational  way 
of  determining  the  index  of  similarity. 

As  a  direct  consequence  of  our  as- 
sumption of  equal  proportions,  we  can 
draw  the  following  general  conclusion. 
Any  change  made  in  a  stimulus  situa- 
tion where  a  response  was  conditioned 
will  reduce  the  probability  of  occurrence 
of  that  response,  provided  the  change 
does  not  introduce  stimuli  which  had 
been  previously  conditioned  to  that  re- 
sponse. This  conclusion  follows  from 
equation  (9)  and  the  fact  that  we  have 
defined  our  similarity  index  in  such  a 
way  that  it  is  never  greater  than  unity. 

A  word  needs  to  be  said  about  the 
correspondence  between  our  result  and 
the  experimental  results  such  as  those 
of  Hovland  (3).  Our  model  predicts 
nothing  about  the  relation  of  the  index 


of  similarity  defined  above  to  such 
physical  dimensions  as  light  or  sound 
intensity,  frequency,  etc.  In  fact,  our 
model  suggests  that  no  such  general 
relation  is  possible,  i.e.,  that  any  sen- 
sible measure  of  similarity  is  very 
much  organism  determined.  There- 
fore, from  the  point  of  view  of  our 
model,  experiments  such  as  those  of 
Hovland  serve  only  as  a  clear  demon- 
stration that  stimulus  generalization 
exists.  In  addition,  of  course,  such 
experiments  provide  empirical  rela- 
tions, characteristic  of  the  organism 
studied,  between  the  proposed  index  of 
similarity  and  various  physical  dimen- 
sions, but  these  relations  are  outside 
the  scope  of  our  model. 

We  conclude,  therefore,  that  our 
model  up  to  this  point  has  made 
no  quantitative  predictions  about  the 
shape  of  generalization  gradients  which 
can  be  compared  with  experiment. 
Nevertheless,  the  preceding  analysis 
of  generalization  does  provide  us  with 
a  framework  to  discuss  experiments  on 
stimulus  discrimination.  In  the  fol- 
lowing sections  we  shall  extend  our 
model  so  as  to  permit  analysis  of  such 
experiments. 

The  Reinforcement  and  Ex- 
tinction Operators 

In  this  section  we  develop  some  re- 
sults that  will  be  used  later  and  show 
that  the  model  of  the  present  paper 
generates  postulates  used  in  our  pre- 
vious paper  (1).  We  shall  examine 
the  step-wise  change  in  probability  of 
a  response  in  a  single  stimulus  situa- 
tion S.  We  generalize  the  notions 
already  presented  as  follows :  Previous 
to  a  particular  trial  or  occurrence  of 
the  response,  a  sub-set  C  of  S  will 
have  been  conditioned.  On  the  trial 
in  question  a  sub-set  X  oi  S  will  be 
perceived  as  shown  in  Fig.  1.  Ac- 
cording to  our  previous  assumptions, 
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the  probability  of  the  response  is 
m(Xr\  C)       m{C) 


P  = 


m{X) 


m{S) 


(10) 


We  now  assume  that  a  sub-set  A  oi  X 
will  be  conditioned  to  the  response  as 
a  result  of  the  reward  given  and  that 
the  measure  of  A  will  depend  on  the 
amount  of  reward,  on  the  strength  of 
motivation,  etc.  We  further  assume 
that  another  sub-set  B  oi  X  will  be- 
come non-conditioned  as  a  result  of 
the  work  required  in  making  the  re- 
sponse. For  simplicity  we  assume 
that  A  and  B  are  disjunct.  (The  error 
resulting  from  this  last  assumption 
can  be  shown  to  be  small  if  the  meas- 
ures of  A  and  B  are  small  compared 
to  that  of  S.) 

We  extend  our  assumption  of  equal 
proportions  so  that  we  have 


m{Ar\C)       m{Br\C)       m(C) 


m{A) 


m{B) 


m{S) 


(11) 


Now  at  the  end  of  the  trial  being  con- 
sidered, sub-set  A  is  part  of  the  new 
conditional  sub-set  while  sub-set  B  is 
part  of  the  new  non-conditioned  sub- 
set. Thus,  the  change  in  the  measure 
of  C  is 

^m{C)  =  [m{A)  -  m{An  C)~\ 

-  m{Bn  C)     (12) 
=  m(A)il  -  p)  -  m(B)p. 

This  last  form  of  writing  equation  (12) 
results  from  the  equalities  given  in 
equations  (10)  and  (11).  If  we  then 
let 

b^m,     (13) 


a  = 


m.(A) 
m{S)  ' 


m{S) 


and  divide  equation  (12)  through  by 
m{S),  we  have  finally  for  the  change 
in  probability: 


^p  = 


Am(C) 
m{S) 


a{\  -p)-  bp.     (14) 


We  thus  define  a  mathematical  oper- 
ator Q  which  when  applied  to  p  gives 


a  new  value  of  probability  Qp  effective 
at  the  start  of  the  next  trial : 

Qp  =  p  +  a(l-p)-  bp.     (15) 

This  operator  is  identical  to  the  gen- 
eral operator  postulated  in  our  model 
for  acquisition  and  extinction  in  a  fixed 
stimulus  situation  (1).  Hence,  the 
set-theoretic  model  we  have  presented 
generates  the  basic  postulates  of  our 
previous  model  which  we  applied  to 
other  types  of  learning  problems  (1). 
When  the  operator  Q  is  applied  n  times 
to  an  initial  probability  po,  we  obtain 

Q^'po  =  pn  =  P«>-  (poo  -  pidg",    (16) 

where  p^,  =  a/(a-\-  b)  and  g=l—a  —  b. 
In  the  next  section  we  shall  apply 
these  results  to  the  experiment  on 
stimulus  discrimination  described  in 
the  Introduction. 

The  Discrimination  Problem 

We  are  now  in  a  position  to  treat 
the  second  experimental  problem  de- 
scribed in  the  Introduction.  An  ani- 
mal is  presented  alternately  with  two 
stimulus  situations  5  and  S'  which  are 
similar,  i.e.,  which  have  a  non-zero 


Fig.  3.  Set  diagram  for  discrimination 
training  in  two  similar  stimulus  situations,  S 
and  S'.  The  various  disjunct  sub-sets  are 
numbered.  Set  5  includes  1,  3,  5,  and  6; 
S'  includes  2,  4,  5,  and  6.  The  intersection  I 
is  denoted  by  5  and  6.  T,  the  complement 
of  /  in  S,  is  shown  by  1  and  3;  T',  the  com- 
plement of  /  in  S',  is  shown  by  2  and  4.  C, 
the  conditioned  sub-set  in  S,  is  represented  by 
3  and  6,  while  the  conditioned  sub-set  in  S', 
is  represented  by  4  and  6.  Tc  is  denoted  by  3, 
To'  by  4,  and  Ic  by  6. 
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intersection.  The  rewards  which  fol- 
low occurrences  of  the  response  are 
different  for  the  two  situations,  and 
we  are  interested  in  how  the  response 
strengths  vary  with  training.  At  any 
point  in  the  process,  sub-sets  of  S 
and  S'  will  be  conditioned  to  the  re- 
sponse as  shown  in  Fig.  3.  We  shall 
distinguish  between  that  part  of  S 
which  is  also  in  S'  and  that  part 
which  is  not  by  letting  /  =  Sr\  S'  and 
T  =  S  -  {Sr\S')  =  S  -  I.  We  also 
distinguish  between  the  part  of  the 
conditioned  sub-set  C  of  5  which  is  in 
/  and  that  which  is  in  T,  by  letting 
Ic=CnI  and  Tc=C-  (Cnl)  =  Tn  C. 
The  probability  of  the  response  in  S  is 


P  = 


m(C)       m(Tc)  +  m(Ic) 


m(S) 
Then  we  let 


m{S) 


^(^r)  m(Ic) 

m{T)  in{I) 


(17) 


(19) 


and,  abbreviating  ■q{S  to  S')  with  t/,  we 
may  write  (17)  in  the  form 


/>  =  a(l  -  r,)  -f  /Stj. 


(20) 


We  write  the  probability  of  the  re- 
sponse in  this  form  because  we  shall 
soon  argue  that  the  index  17  varies 
during  discrimination  training.  First, 
however,  we  shall  investigate  the  vari- 
ation of  a  and  ^  with  the  number  of 
training  trials.  From  the  definitions 
of  a  and  /3,  equations  (18)  and  [\9), 
we  see  that  these  variables  are  very 
much  like  our  probability  p  of  equa- 
tion (17)  except  that  they  refer  to 
sub-sets  of  S  rather  than  to  the  entire 
set.  By  strict  analogy  with  the  argu- 
ments in  the  last  section,  we  conclude 
that 

a»  =  Q^ota  =  ttoo  -  (««  -  ao)g",     (21) 

where  «„  =  a/(a -f  h)  and  g  =  \.—a  —  h. 
Now,  jS,  the  fraction  of  conditioned 
stimuli  in  the  intersection  /,  changes 
with  each  presentation  of  S'  as  well  as 


of  S.  Thus,  for  each  presentation  of 
S,  we  must  operate  on  /3  twice,  once 
by  our  operator  Q  which  describes  the 
effect  of  the  environmental  events  in 
S,  and  once  by  an  analogous  operator 
Q'  which  describes  the  effect  of  the 
events  in  S' .  Hence,  it  may  be  shown 
that 

/5n  =  {Q'QY^o  =  ^00  -  {^^  -  W,  (22) 

where 


a'  +  a(l  -a'  -  h') 
{a'  +  a{\  -a'  -  h') 

+  6'  +  &(l  -a'  -h')] 


(23) 


and  where  f=  {\-a' -h'){\-a-h). 

It  should  be  stressed  that  we  are 
assuming  that  the  response  occurs  and 
is  rewarded  to  the  same  degree  on 
every  presentation  of  5.  The  same 
statement,  mutatis  mutandis,  applies 
to  S'.  Without  this  assumption,  we 
are  not  justified  in  applying  the  oper- 
ators Q  and  Q'  for  each  presentation. 
The  probability  is  then  the  probability 
that  the  response  will  occur  in  an 
interval  of  time,  h.  The  operational 
measure  of  this  probability  is  the  mean 
latent  time,  which  according  to  the 
response  model  discussed  earlier  varies 
inversely  as  the  probability  (1). 

We  now  have  cleared  the  way  for 
discussing  the  central  feature  of  our 
model  for  discrimination  problems. 
We  conceive  that  the  measure  of  the 
intersection  I  of  the  two  sets  S  and  S' 
decreases  as  discrimination  learning 
progresses.  This  concept  seems  to 
make  sense  intuitively  since  the  meas- 
ure of  any  sub-set  of  stimuli  indicates 
the  importance  of  that  sub-set  in  in- 
fluencing behavior.  If  an  animal  is 
rewarded  for  a  response  in  5  but  not 
rewarded  for  it  in  S' ,  then  the  stimuli 
in  /are  unreliable  for  deciding  whether 
or  not  to  make  the  response.  And  it 
is  just  this  ambiguity  which  causes 
the  measure  of  the  intersection  to 
decrease  with  training.  We  shall  de- 
scribe this  change  by  introducing  a 
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"discrimination  operator,"  denoted  by 
D,  which  operates  on  the  similarity 
index  r;  each  time  the  environmental 
event  following  the  response  changes 
from  one  type  of  event  to  another, 
e.g.,  from  reward  to  non-reward.  In 
the  present  problem,  we  are  consider- 
ing alternate  presentations  of  5  and  S' 
and  thus  alternate  occurrences  of  the 
events  associated  with  the  operators 
Q  and  Q'.  So  if  tj,  is  the  ratio  of  the 
measure  of  I  to  that  of  S  after  the 
ith  presentation  of  S,  the  ratio  after 
the  (i  -|-  l)th  presentation  is 


Tji^i  =  Dr^i. 


(24) 


Our  next  task  is  to  postulate  the  form 
of  the  operator  D. 

We  find  that  neither  experimental 
data  nor  our  intuition  is  of  much  help 
in  guiding  our  choice  of  such  a  postu- 
late. For  mathematical  simplicity  we 
choose  an  operator  which  represents  a 
linear  transformation  on  ij.  More- 
over, we  wish  to  have  an  operator 
which  always  decreases  77  ^or  holds  it 
fixed),  but  which  will  never  lead  to 
negative  values  of  tj.  Therefore,  we 
postulate  that 


Dr]  =  ki], 


(25) 


where  ^  is  a  new  parameter  which  is 
in  the  range  between  zero  and  1.  We 
then  have 


Vn  =  D^riQ  =  k^no. 


(26) 


Combining  equations  (20),  (21),  (22), 
and  (26),  we  have 

Pn  =  (2"«o(l  -  D-rio)  +  (Q'Qy^oD-m 
=  [««>  -  (a«  -  ao)g"](l  -  ^"t?o) 

+  C^oc-(^oo-^o)/">"r7o.    (27) 

This  is  our  final  expression  for  the 
variation  of  pn,  the  probability  of  the 
response  in  situation  S,  as  a  function 
of  the  trial  number  n.  This  equation 
is  composed  of  two  major  terms.  The 
first  term  corresponds  to  the  relative 
measures  of  the  stimulus  elements  of 


S  which  are  not  in  S'  (the  measure  of 
Tc  divided  by  the  measure  of  S). 
The  second  term  corresponds  to  the 
relative  measure  of  the  elements  in 
the  intersection  of  S  and  S'  (the  meas- 
ure of  le  divided  by  the  measure  of  S). 
Because  of  the  symmetry  between 
S  and  S',  we  may  write  for  the  proba- 
bility in  S': 

pn'  =  [««'  -  (a  J  -  ao')g'"](l  -  k'^no) 

+  [^ao-(^oo-)8o)/''>V,     (28) 

where  a„'  =  a' /(a'  -f  b'),  and  g'  =  1 
—  a'  —  b',  and  where  t/o'  is  the  initial 
value  of 


v'  =  v(S'  to  S) 


m(I) 
m(S') 


m(S) 


(29) 


We  shall  now  consider  some  special 
examples  for  which  certain  simplifying 
assumptions  can  be  made. 

(a)  No  conditioning  before  discrimi- 
nation training.  If  no  previous  con- 
ditioning took  place  in  either  S  or  S', 
it  seems  reasonable  to  assume  that 
the  "operant"  levels  of  performance 
in  the  two  situations  are  the  same. 
Moreover,  in  view  of  our  assumptions 
of  equal  proportions,  we  may  assume 
that  initially: 


m(C)  _  m{Tc)  _  m(Ic) 

m(S)       m{T)        m{I) 

^  m{TJ)  _  mjC) 
~  m(T')   ~  m{S')  ' 


(30) 


Hence,  from  equations  (17),  (18),  and 
(19),  we  have  /)o  =  ao  =  ao'  =  /So. 
Moreover,  inspection  of  equation  (27) 
shows  that,  except  when  ^  =  1,  we 
have  p^  =  ax,  and  in  like  manner 
from  equation  (28)  for  k  9^  1,  we  have 
pj  =  aj.  In  Fig.  4  we  have  plotted 
equations  (27)  and  (28)  with  the  above 
assumptions.  The  values  a  =  0.12, 
b  =  0.03,  />o  =  0.05,  170  =  W  =  0.50, 
k  =  0.95  were  chosen  for  these  calcu- 
lations.    As  can  be  seen,  the  proba- 
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i.o 


ao 


Fig.  4.  Curves  of  probability,  p  (in  S), 
and  p'  (in  S'),  versus  trial  number,  n,  for 
discrimination  training  without  previous  con- 
ditioning. It  was  assumed  that  the  response 
was  rewarded  in  S  but  not  rewarded  in  S'. 
Equation  (27),  equation  (28),  and  the  values 
p^  =  p^'  =  0.05,  a  =  0.12,  o'  =  0,  6  =  6' 
=  0.03,  ijo  =  no'  =  0.50,  and  k  =  0.95  were 
used. 

bility  of  the  response  in  5  is  a  mono- 
tonically  increasing,  negatively  accel- 
erated function  of  the  trial  number, 
while  the  probability  in  S'  first  in- 
creases due  to  generalization,  but  then 
decreases  to  zero  as  the  discrimination 


2.0 


Fig.  5.  Reciprocals  of  probability,  p,  of 
the  response  in  S,  and  p',  of  the  response  in  S', 
versus  trial  number,  n,  for  discrimination 
training  without  previous  conditioning.  In 
the  model  described  earlier  (1),  mean  latent 
time  is  proportional  to  the  reciprocal  of  prob- 
ability. The  curves  were  plotted  from  the 
values  of  probability  shown  in  Fig.  4. 


is  learned.  These  curves  describe  the 
general  sort  of  result  obtained  by 
Woodbury  for  auditory  discrimination 
in  dogs  (4). 

We  have  argued  (1)  that  the  mean 
latent  time  varies  inversely  as  the 
probability.  Thus  in  Fig.  5  we  have 
plotted  the  reciprocals  of  p„  and  pn 
given  in  Fig.  4.  These  curves  exhibit 
the  same  general  property  of  the  ex- 
perimental curves  on  running  time  of 
rats  obtained  by  Raben  (5) . 

(b)  Complete  conditioning  in  S  before 
discrimination  training.    Another  spe- 


FiG.  6.  Curves  of  probability,  p,  and  its 
reciprocal  versus  trial  number,  n,  for  the  case 
of  complete  conditioning  in  S  before  the  dis- 
crimination training.  Equation  (27)  with  the 
values  />.o  =  1,  /3co  =  0,  7/0  =  0.80,  k  =  0,90, 
and  /  =  0.50  were  used. 

cial  case  of  interest  is  that  in  which 
the  set  5  is  completely  conditioned  to 
the  response  before  the  discrimination 
experiment  is  performed.  In  this  case, 
ao  =  /So  =  ^0  =  px-  In  Fig.  6  we  have 
plotted  pn  and  \/pn  with  these  condi- 
tions and  the  values  p„  =  1,  i8«>  =  0, 
7?o  =  0.80,  k  =  0.90,  and  /  =  0.50. 
The  curve  of  1/p  versus  n  is  similar 
in  shape  to  the  experimental  latency 
curve  obtained  by  Solomon  (6)  from 
a  jumping  experiment  with  rats. 

(c)  Limiting  case  of  S  and  S'  identi- 
cal. Another  limiting  case  of  the  kind 
of    discrimination    experiment    being 
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considered  here  obtains  when  we  make 
the  two  stimulus  situations  S  and  S' 
identical.  The  problem  degenerates 
into  one  type  of  partial  reinforcement 
where,  for  example,  an  animal  is  re- 
warded on  every  second  trial  in  a  fixed 
stimulus  situation.  The  intersection 
I  oi  S  and  S'  is  of  course  identical  to 
both  5  and  S'.  Thus  the  measure  of 
/  must  equal  the  measure  of  5.  From 
equation  (5),  we  have 


_  m(I) 
"^       ni(S)         ' 


(31) 


while  according  to  our  postulate  about 
the  operator  D,  equation  (26),  the 
similarity  index  varies  from  trial  to 
trial : 

Vn    =    k^VO-  (32) 

For  S  and  S'  identical,  the  above  two 
equations  are  incompatible,  unless  we 
take  k  =  I.  Thus,  we  are  forced  to 
assume  that  k  depends  on  how  many 
cues  are  available  for  discrimination 
in  such  a  way  that  k  =  I  when  none 
are  available.  Moreover,  since  /  and 
5  are  identical,  the  measure  of  T,  the 
complement  of  /  in  S,  must  be  zero. 
Since  Tc  is  a  sub-set  of  T,  the  measure 
of  Tc  must  also  be  zero.  Therefore, 
equations  (17)  and  (19)  give  in  place 
of  equation  (20) 


P  =  ^V- 


(33) 


But  we  have  just  argued  that  for  5 
and  S'  identical,  we  have  t}  =  I.    Thus 


P  = 


(34) 


Equation  (22)  gives  us  then 

Pn  =  (Q'Qypo  =  P.-  (poo  -  po)f".  (35) 

This  equation  agrees  with  our  previous 
result  on  partial  reinforcement  (1). 

(d)  Irregular  presentations  of  S  and 
S'.  In  most  experiments,  ^  and  S' 
are  not  presented  alternately,  but  in 
an  irregular  sequence  so  that  the  ani- 
mal cannot  learn  to  discriminate  on 


the  basis  of  temporal  order.  A  simple 
generalization  of  the  above  analysis 
will  handle  the  problem.  The  usual 
procedure  is  to  select  a  block  of  {j  -f-  /) 
trials  during  which  5  is  presented  j 
times  and  S'  presented  /  times.  The 
actual  sequence  is  determined  by  draw- 
ing "5  balls"  and  "6"  balls"  at  random 
from  an  urn  containing  j  "S  balls" 
and  /  "6"  balls."  This  sequence  is 
then  repeated  throughout  training. 
In  our  model,  we  can  describe  the 
effects  on  the  probability  of  a  known 
sequence  by  an  appropriate  applica- 
tion of  our  operators  Q,  Q',  and  D  for 
presentations  of  S,  presentations  of 
S',  and  shifts  from  one  to  the  other, 
respectively.  A  less  cumbersome 
method  provides  a  reasonable  approxi- 
mation :  for  each  block  of  (j  +  j') 
trials  we  describe  an  effective  or  ex- 
pected new  value  of  probability  by 
applying  Q  to  its  operand  j  times,  Q' 
to  its  operand  f  times,  and  D  to  the 
index  ij  a  number  of  times  determined 
by  the  mean  number  of  shifts  from  S 
to  S'.  For  the  special  case  of  j  =  f, 
the  mean  number  of  shifts  is  j.  Since 
previously,  we  applied  D  tori  for  each 
pair  of  shifts,  we  write  for  the  (i4-l)th 
block  of  {2j)  trials 

+  (Q'Q)%D"'r,i.     (36) 

The  rest  of  the  analysis  exactly  par- 
allels that  given  above  for  the  case  of 
alternate  presentations  of  5  and  S'. 
The  results  will  be  identical  except 
for  the  value  of  k  involved  in  the 
operator  D. 

Summary 

A  mathematical  model  for  stimulus 
generalization  and  discrimination  is 
described  in  terms  of  simple  set-theo- 
retic concepts.  An  index  of  similarity 
is  defined  in  terms  of  the  model  but  is 
related  to  measurements  in  generaliza- 
tion experiments.  The  mathematical 
operators  for  acquisition  and  extinc- 
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tion,  discussed  in  an  earlier  paper  (1), 
are  derived  from  the  set-theoretic 
model  presented  here.  The  model  is 
finally  applied  to  the  analysis  of  ex- 
periments on  stimulus  discrimination. 

[MS.  received  October  13,  1950] 
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TWO-CHOICE  BEHAVIOR  OF  PARADISE  FISH 


ROBERT  R.  BUSH  AND  THURLOW  R.  WILSON 
Harvard  University  * 
Our  problem  stems  principally  from      individua 


two  experiments.  Brunswik  (1)  ob- 
served the  acquisition  of  a  position 
discrimination  by  rats  when  food  was 
placed  more  frequently  in  one  box. 
Research  by  Humphreys  (9)  was 
comparable  in  that  S  had  two  choices 
with  partial  reinforcement  of  both. 
He  required  college  students  to  guess 
on  every  trial  whether  or  not  a  light 
would  flash,  and  then  in  accordance 
with  a  predetermined  schedule,  the 
light  did  or  did  not  flash.  The 
Humphreys  study  exemplifies  a  non- 
contingent  procedure  for  two-choice 
learning  since  the  flash  of  the  light 
did  not  depend  upon  the  choice  made 
by  S.  Brunswik's  rats  faced  a  con- 
tingent situation  since  the  environ- 
mental change,  presentation  of  food, 
was  contingent  in  part  on  S's  response. 
A  contingent  two-choice  research  on 
humans  has  been  performed  by  Good- 
now  (2,  pp.  294-296).  Her  Ss  de- 
cided on  every  trial  which  of  two 
buttons  to  press.  If  the  choice  was 
correct,  they  earned  a  poker  chip, 
otherwise  not.  Human  two-choice 
learning  with  partial  reinforcement 
has  been  further  observed  under 
■contingent  procedure  (3)  and  under 
noncontingent  procedure  (3,  4,  5,  6, 
7,  8,  10). 

Bush  and  Mosteller  (2)  suggest 
that  these  two  types  of  procedures  are 
associated  with  different  forms  of 
asymptotic  choice  distribution  (choice 
distribution    after    learning)    for    the 

^  This  research  was  supported  by  the  Labora- 
tory of  Social  Relations,  Harvard  University. 
We  are  indebted  to  W.  S.  Verplanck  for  sug- 
gesting that  we  use  fish  in  learning  experiments 
and  to  F.  Mosteller  for  numerous  suggestions 
and  criticisms. 


S's.  In  general,  most  Ss 
in  a  contingent  experiment  are  found 
to  have  an  asymptotic  choice  distri- 
bution of  100%  selection  of  the 
favorable  alternative.  Noncontin- 
gent situations  give  rise  to  other  kinds 
of  choice  distributions;  in  such  experi- 
ments, the  asymptotic  proportion  of 
choices  of  the  favorable  alternative 
has  been  observed  to  match  the  pro- 
portion of  reinforcements  scheduled 
for  the  alternative.^ 

We  attempted  to  obtain  the  non- 
contingent  results  with  nonhuman  Ss. 
Red  paradise  fish  were  confronted  by 
a  position  discrimination  with  partial 
reinforcement  in  which  one  side  was 
correct  a  random  75%  and  the  other 
side  correct  for  the  remaining  25%. 
The  apparatus  was  a  discrimination 
box  with  adjacent  goal  compartments. 
For  the  experimental  Ss,  E  placed  the 
food  in  the  correct  compartment 
regardless  of  whether  S  had  entered 
the  correct  goal  box;  the  division 
between  the  two  goal  boxes  was 
transparent  for  the  experimental 
group  so  that  these  Ss  were  able  to 
see  the  food  in  the  correct  compart- 
ment when  they  had  chosen  incor- 
rectly. The  control  group  was  run 
with  an  opaque  divider  separating  the 
goal  compartments  in  order  to  produce 
conditions  comparable  to  those  used 
by  Brunswik. 

Theory 

We  attempt  to  describe  the  experi- 
mental data  within  the  framework  of  the 

^  Besides  contingent  and  noncontingent  pro- 
cedure, other  kinds  of  factors,  such  as  a  gambling 
versus  a  problem-solving  orientation,  have  been 
related  to  asymptotic  choice  distribution  (6) 
We  shall  not  deal  with  these  other  factors. 


This  article  appeared  in  J.  exp.  Psychol.,  1956,  51,  315-322.     Reprinted  with  permission. 
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Stochastic  model  given  by  Bush  and 
Mosteller  (2).  On  trial  «  (where  k  =  0, 
1,2,..  .)  there  exists  a  probability  />„ 
that  S  will  choose  the  more  favorable 
side.  One  of  four  events  occurs  on  this 
trial  and  each  leads  to  a  different  value 
of  pn+i-  As  in  similar  analyses,  we 
assume    that    the    effect    of    feeding    is 


symmetrical  for  the  two  goal  boxes;  we 
make  a  similar  assumption  for  non- 
feeding.  In  addition,  we  assume  that  a 
long  sequence  of  feedings  on  one  side 
would  tend  to  make  the  probability  of 
going  there  unity.  These  special  as- 
sumptions reduce  the  general  model  to 
the  following  statements  about  pn+i- 


Probability 

Event 

Pn+l 

of  occurrence 

favorable  side,  food 

aipn 

+  (1  -  «l) 

JSpn 

favorable  side,  no  food 

Oiipn 

+  (1  -  a2)X 

.2Spn 

unfavorable  side,  food 

aipn 

.25(1   -   Pn) 

unfavorable  side,  no  food 

a2pn 

+   (1-«2)(1- 

-X) 

.75(1 -^„) 

The  model  previously  used  for  ana- 
lyzing two-choice  experiments  using  the 
contingent  procedure  is  obtained  from 
the  above  table  by  imposing  the  further 
restriction  that  ai  =  1.  This  assump- 
tion implies  that  nonfeeding  is  an  event 
which  does  not  alter  the  response  proba- 
bilities. It  was  expected  that  this  model 
would  describe  learning  by  the  control 
group  in  the  present  experiment.  Given 
this  specific  model,  it  can  be  shown  that 
the  asymptotic  p  for  each  S  will  be  either 
1.0  or  0;  for  the  75:25  schedule  it  is 
predicted  that  a  high  percentage  of  i's 
will  tend  towards  1.0.  The  exact  per- 
centage depends  upon  the  value  of  ai. 

We  propose  two  specific  models  for  the 
experimental  group  of  the  present  ex- 
periment. These  models  are  obtained 
from  the  foregoing  table  by  imposing 
two  different  sets  of  additional  restric- 
tions which  in  turn  are  suggested  by  two 
different  theories  of  learning.  The  first 
specific  model,  herein  called  the  in- 
formation model,  is  obtained  by  taking 
ai  =  ai  and  X  =  0.  As  a  result,  the 
first  and  fourth  listed  events  in  the  fore- 
going table  have  the  same  effect  on  p„; 
they  correspond  to  food  being  placed  on 
the  favorable  side.  Similarly,  the  second 
and  third  listed  events  have  the  same 
effect;  they  correspond  to  food  being 
placed  on  the  unfavorable  side.  These 
restrictions  appear  to  arise  most  readily 
from  a  cognitive  learning  point  of  view, 
because  each  trial  may  be  described  as 
providing  information  about  the  payoff 


schedules.  This  information  model  is 
equivalent  to  the  models  used  by  Bush 
and  Mosteller  (2)  and  by  Estes  (4)  for 
describing  human  experiments  with  the 
non-contingent  procedure. 

The  other  specific  model  for  the  ex- 
perimental group,  herein  called  the 
secondary  reinforcement  model,  is  obtained 
from  the  additional  restrictions,  X  =  1 
and  Q!2  >  ai.  This  model  assumes  that 
when  S  enters  one  goal  box  and  sees  food 
in  the  other  goal  box  it  is  secondarily 
reinforced  for  the  response  just  made. 
It  has  been  shown  that  this  model  pre- 
dicts that  each  S  will  have  an  asymptotic 
p  of  1  or  0  and  that  more  i's  will  tend 
towards  1  than  0.  The  precise  pro- 
portion that  tend  towards  1  depends  on 
the  values  of  ai  and  ai- 

We  are  chiefly  concerned  with  pre- 
dictions about  the  forms  of  the  asym- 
ptotic distributions  of  choices  of  the 
favorable  side.  These  predictions  could 
be  tested  experimentally  by  running 
many  trials  in  the  experiment  and  ob- 
taining a  proportion  of  choices  for  each 
S  during,  say,  the  last  100  trials.  The 
proportions  thus  obtained  would  form 
a  distribution  which  could  be  compared 
with  the  predicted  ones.  Unfortunately, 
the  mathematical  analysis  presented  by 
Karlin  (11)  suggests  that  the  con- 
vergence of  the  distributions  of  these 
models  is  very  slow.  Therefore,  a  great 
many  trials  would  be  required  in  the 
experiment  to  obtain  the  desired  distri- 
bution. In  view  of  these  considerations, 
we    are    forced    to    examine    the    "near- 
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asymptotic"  distributions.  The  infor- 
mation model  predicts  that  such  a 
distribution  will  be  clustered  around  a 
point  just  below  .75,  whereas  the 
secondary  reinforcement  model  predicts 
that  it  will  be  U-shaped  with  a  peak  near 
1  and  a  somewhat  smaller  peak  near  0. 
The  model  for  the  control  group  (a2  =  1) 
also  predicts  a  U-shaped  near-asymptotic 
distribution,  but  the  peak  near  0  should 
be  very  small  compared  to  that  for  the 
secondary  reinforcement  model.  These 
predictions  are  compared  with  data 
below. 

Method 

Subjects. — The  Ss  were  49  red  paradise  fish,  27 
in  the  control  group  and  22  in  the  experimental 
group.  The  red  paradise  fish  (Macropodus 
opercularis)  is  a  hardy  tropical  fish  about  2  in. 
in  length  selected  because  of  its  small  demands 
for  care.  The  Ss  were  housed  separately  in 
tanks  with  a  water  temperature  of  80°  ±  1°F. 
This  was  the  temperature  indicated  by  our 
feeding  studies  for  maximum  appetite.  Lighting 
was  by  fluorescent  fixtures  which  were  auto- 
matically turned  on  for  a  standard  12-hr.  period 
each  day  to  control  the  activity  cycle.  (This 
fish  has  a  diurnal  rhythm  of  activity.) 

Apparatus. — The  apparatus  was  a  discrimi- 
nation box  as  shown  in  Fig.  1.  The  maze  was 
constructed  of  J-in.  opaque  white  Plexiglas, 
except  for  parts  of  the  goal  boxes.  The  control 
group  had  a  white  opaque  divider,  whereas  for 
the  experimental  group  this  divider  was  trans- 
parent. For  one  goal  box  the  side  opposite  the 
entrance  to  the  box  was  formed  from  a  piece  of 
opaque  light  yellow  plastic;  the  corresponding 
side  of  the  other  box  was  white  opaque.  These 
sides  could  be  interchanged.  (Exploratory 
studies  indicated  that  a  position  discrimination 
with  identical  goal  boxes  is  learned  very  slowly 
by  these  fish.) 

The  apparatus  was  placed  in  a  10-gal.  tank 
shielded  from  room  lights.  Lighting  came 
largely  from  a  75-w.  spotlight  2  ft.  above  the 
maze  and  focused  on  the  start  chamber.  Care 
was  taken  to  ensure  that  water  conditions  of  this 
experimental  tank  were  as  close  as  possible  to 
those  of  the  home  tanks  of  Ss. 

Feeding. — The  experimental  food  was  pre- 
pared fish  eggs  from  an  inexpensive  (10  cents  an 
ounce)  caviar  ("Lumpfish  caviar"  packed  by 
Hansen  Caviar  Co.,  New  York,  N.  Y.).  These 
eggs  were  found  to  be  a  highly  preferred  food  of 
the  paradise  fish  and  were  convenient  to  obtain 
and  store.  The  eggs  were  presented  singly; 
t  he  egg  was  held  on  the  end  of  a  medicine  dropper 
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Fig.  L     Sketch  of  the  discrimination 
apparatus. 

by  suction  (the  egg  was  1  mm.  to  2  mm.  in 
diameter  and  larger  than  the  opening  of  the 
dropper).  To  secure  the  egg,  the  fish  was 
obliged  to  pull  it  from  the  dropper.  A  fish  was 
required  to  earn  all  of  its  food  by  solving  the 
discrimination  problem. 

Pretraining. — The  pretraining  took  two  or 
three  days.  On  the  first  day  the  fish  was  fed 
eggs  (10  or  20)  by  eye  dropper  in  its  home  tank. 
For  the  next  one  or  two  days  the  fish  underwent 
forced  trials  (10  or  20)  in  the  maze.  Half  of  the 
forced  trials  were  to  the  right-side  goal  box. 
About  one-third  of  the  fish  were  rejected  from 
the  experiment  at  the  end  of  pretraining  or  after 
one  or  two  days  of  discrimination  training 
leaving  49  Ss.  (Fish  were  rejected  because  they 
would  not  eat  In  the  apparatus  or  because  E 
made  an  error  in  procedure.) 

Procedure. — All  Ss  received  a  total  of  140 
trials,  20  trials  a  day  or  less.  One  goal  box  (the 
favorable  side)  was  scheduled  for  reinforcement 
on  75%  of  the  trials  while  the  other  goal  was 
scheduled  for  reinforcement  on  the  remaining 
trials.  On  a  given  trial  only  one  goal  box  was 
correct.  The  trials  for  which  the  favorable  side 
was  incorrect  were  selected  by  restricted  ran- 
domization within  blocks  of  20.  The  restriction 
was  that  runs  of  Incorrect  could  not  be  longer 
than  two.     All  fish  had  the  same  schedule. 

The  right,  yellow  side  was  favorable  for  about 
one-fourth  of  the  Ss;  right,  white  for  one-fourth; 
left  and  yellow  for  one-fourth,  and  left  and  white 
for  one-fourth. 

The  procedure  for  the  control  group  was  as 
follows.  The  fish  was  released  from  the  start 
chamber,  and  it  swam  down  to  the  goal  boxes. 
If  the  fish  poked  Its  nose  into  the  goal  box  which 
was  correct  for  that  trial,  E  lowered  a  medicine 
dropper  with  a  fish  egg  into  the  compartment 
(the  dropper  was  secured  to  an  arm)  to  allow 
the  fish  to  feed.  If  the  fish  entered  the  Incorrect 
goal  box,  no  food  was  placed  in  the  goal  box.  In 
either  case,  the  fish  was  chased  back  into  the 
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start  chamber  after  3-4  sec.  in  the  goal  box. 
This  was  accomplished  with  a  piece  of  plastic 
of  width  slightly  less  than  the  width  of  the  maze; 
the  fish  quickly  developed  avoidance  tendencies 
to  this  "paddle."  As  soon  as  it  was  lowered  into 
the  tank,  the  fish  promptly  returned  to  the  start 
chamber.  The  interval  between  trials  was  12 
sec.  No  retracing  was  permitted.  Except  for 
the  transparent  rather  than  opaque  piece  di- 
viding the  two  goal  boxes,  the  procedure  for  the 
experimental  group  differed  only  in  one  detail: 
after  the  fish  had  entered  a  compartment  E 
placed  the  medicine  dropper  with  a  fish  egg  in 
the  correct  goal  box.  If  the  fish  had  entered  the 
correct  goal  box,  it  secured  the  egg.  Otherwise 
the  fish  could  see  the  egg  through  the  transparent 
divider  but  could  not  obtain  it.  Observations 
indicated  that  they  did  in  fact  see  the  egg  on 
most  of  these  trials. 

Results  and  Discussion 

Initial  preferences. — Position  and 
color  preferences  may  strongly  in- 
fluence the  results  of  a  discrimination 
study.  For  this  reason  the  balanced 
design  described  in  the  preceding 
section  was  used.  This  technique, 
however,  tends  to  eliminate  a  group 

TABLE  1 

Observed   Distribution  of   Choices  of   the 
Favorable  Side  During  the  First  Ten 
Trials  for  the  Two  Groups  of  Fish 
Combined,   and  the  Theoretical 
Distributions  for  the  Binomial 
Model  (P  =  .5)  and  for  the 
Symmetric  Beta  Distri- 
bution WYVW. 
J-  .7 


I 


Number 

Observed 

Number 

Fish 

Predicted 

Choices 

Binomial 

Beta 

0 

2 

0.05 

2.34 

1 

3 

0.48 

3.71 

2 

5 

2.16 

4.66 

3 

7 

5.75 

5.28 

4 

4 

10.06 

5.64 

5 

8 

12.07 

5.76 

6 

8 

10.06 

5.64 

7 

1 

5.75 

5.28 

8 

5 

2.16 

4.66 

9 

2 

0.48 

3.71 

10 

4 

0.05 

2.34 

49 

49.07 

49.02 

preference  only.  From  an  analysis 
of  variance  of  the  responses  on  the 
first  10  trials,  we  concluded  that  there 
were  no  group  color  or  position  prefer- 
ences but  that  there  were  individual 
preferences.  The  stochastic  models 
used  in  analyzing  the  data  are  sensi- 
tive to  the  entire  distribution  of  initial 
probabilities,  not  only  its  mean. 
Therefore,  it  is  necessary  to  consider 
the  actual  distribution. 

One  binomial  observation  for  each 
initial  probability  is  insufficient  to 
determine  anything  about  the  initial 
distribution  except  the  mean.  Thus 
we  must  look  at  the  number  of  suc- 
cesses (choices  of  the  favorable  side) 
by  each  fish  during  the  first  several 
trials  and  assume  that  the  probability 
for  each  fish  does  not  appreciably 
change  during  these  trials.  For  this 
purpose  the  two  groups  of  Ss  were 
combined,  giving  an  N  of  49,  and  the 
first  10  trials  of  the  data  were  used. 
In  Table  1  we  show  the  frequencies 
of  choice  observed  as  well  as  those 
predicted  by  two  models  which  are 
now  briefly  discussed. 

The  mean  number  of  observed 
successes  during  the  first  10  trials  is 
.496  and  so  the  balanced  design  ac- 
complished its  purpose.  But,  if  we 
assume  that  each  of  the  49  fish  had  a 
binomial  probability  of  .5,  the  pre- 
dicted frequencies  of  choices  are  those 
shown  in  the  third  column  of  Table  1. 
The  discrepancies  are  highly  sig- 
nificant. The  likelihood  ratio  test 
(12,  p.  257)  (this  is  essentially  the  chi- 
square  test)  leads  to  P  <  .005. 
Therefore  we  consider  an  alternative 
assumption :  that  the  initial  distri- 
bution is  a  symmetric  beta  dis- 
tribution (12,  p.  115)  with  a  mean 
of  .5.     It  may  be  written  in  the  form 

f{p)  =  Clp{\  -  p)-]', 

where  C  is  a  constant  chosen  so  that 
the  total  density  is  unity,  and  where 
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Fig.  2.  Learning  curve  for  each  of  the  two 
groups  of  fish  and  for  the  22  stat-fish  which 
parallel  the  experimental  group.  Mean  pro- 
portion of  choices  of  the  favorable  side  is  plotted 
for  each  block  of  10  trials. 


J  is  a  parameter  which  determines  the 
spread  of  the  distribution.  The 
method  of  maximum  likelihood  (12, 
pp.  152-160)  was  used  to  estimate  s 
from  the  data,  giving  .7  as  the  esti- 
mate. The  distribution  of  successes 
during  10  trials  can  then  be  computed. 
The  results  are  shown  in  the  last 
column  of  Table  1  and  the  likelihood 
ratio  test  gives  P  =  .4.  This  fit  was 
considered  satisfactory. 

Learning  curves. — In  Fig.  2  we 
show  the  proportion  of  successes  in 
blocks  of  10  trials  for  each  of  the  two 
groups  of  fish.  It  is  clear  that  the 
control  group  learned  more  rapidly 
than  the  experimental  group,  but 
little  more  can  be  inferred  from  this 
figure.  One  can  conjecture,  of  course, 
that  the  sight  of  food  in  the  opposite 
goal  box  when  food  was  not  obtained 
slowed  down  the  learning  process. 
Just  how  this  comes  about  can  be 
determined  only  by  a  more  detailed 
analysis  of  the  data. 

We  hasten  to  note  at  this  point  that 
the   models   described    above   do   not 


predict  the  relative  rates  of  learning 
of  groups  of  Ss  run  under  different 
experimental  conditions.  Within  the 
framework  of  the  models,  rates  of 
learning  are  determined  by  the  values 
of  parameters  which  must  be  esti- 
mated from  data.  The  models  do 
predict,  however,  other  properties  of 
the  data  considered  in  the  following 
sections. 

The  near-asymptotic  distributions. — 
The  two  specific  models  for  the  experi- 
mental group — the  information  model 
and  the  secondary  reinforcement 
model — make  very  different  predic- 
tions about  the  shape  of  the  distri- 
butions of  successes  after  learning  is 
nearly  complete.  In  the  second 
column  of  Table  2  we  show  the 
frequencies  of  successes  during  the  last 
49  trials  (the  number  of  successes 
varies  from  0  through  49).  The  ob- 
served U-shaped  near-asymptotic  dis- 
tribution is  not  determined  by  initial 
preferences  alone;  the  rank-order  cor- 
relation coefficient  between  the  num- 
ber of  favorable  choices  on  the  first 
and  last  10  trials  is  .22.  The  in- 
formation model  predicts  a  clustering 

TABLE  2 

Distribution  of  Successes  (Choices  of  the 

Favorable    Side)    During   the   Last   49 

Trials  for  the  Two  Groups  of  Fish 

AND  for  the  22  Stat-Fish  Which 

Parallel  the  Experimental 

Group  of  Real  Fish 


Number 
Successes 

Experi- 
mental 
Group 

Stat-Fish 

Control 
Group 

0-4 

4 

4 

1 

5-9 

1 

2 

0 

10-14 

9 

0 

0 

15-19 

0 

0 

1 

20-24 

0 

0 

2 

25-29 

0 

1 

0 

30-34 

1 

0 

2 

35-39 

2 

2 

3 

40-44 

2 

3 

7 

45-49 

10 

10 

11 

22 

22 

n 
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around  37  but  this  prediction  is 
clearly  not  confirmed  by  the  experi- 
mental group  data.  The  secondary 
reinforcement  model,  on  the  other 
hand,  predicts  a  U-shaped  distri- 
bution with  greater  density  at  the 
high  end  than  at  the  low  end.  This 
prediction  is  confirmed.  On  this  basis 
alone  we  can  choose  the  secondary 
reinforcement  model  in  favor  of  the 
information  model.  Detailed  ques- 
tions of  goodness  of  fit  are  considered 
in  the  following  sections. 

The  model  proposed  for  the  control 
group  involves  the  assumption  that 
nonreward  has  no  effect  (ao  =  1)  and 
it  predicts  that  the  near-asymptotic 
distribution  of  successes  will  also  be 
U-shaped  but  with  very  small  density 
at  the  low  end.  This  indeed  agrees 
with  the  data  shown  in  the  last  column 
of  Table  2 ;  one  out  of  27  fish  stabilized 
at  the  unfavorable  side — it  chose  that 
side  46  times  during  the  last  49  trials. 
The  other  26  fish  either  stabilized  on 
the  favorable  side  or  did  not  yet 
stabilize  during  the  trials  run.  In 
the  next  section  we  consider  the  basic 
assumption  that  a-z  =  1  made  in  the 
model  for  the  control  group. 

Parameter  estimates. — Having  chosen 
the  secondary  reinforcement  model  for 
the  experimental  group,  we  need  to 
estimate  the  primary  reward  parameter, 
a\,  and  the  secondary  reward  parameter, 
a^.  These  estimates  are  required  for 
two  reasons:  {a)  we  wish  to  measure 
the  relative  effects  of  primary  and 
secondary  reinforcement  in  this  experi- 
ment   (the   smaller   the   value   of  a,    the 

TABLE  3 

Estimates  of  the  Two  Parameters  Obtained 
FOR  Each  of  the  Two  Groups  of  Fish 


Parameters 

Experi- 
mental 
Group 

Control 
Group 

Primary  reward,  ai 
Secondary  reward,  a-i 

0.916 
0.942 

0.956 
0.986 

greater  the  effect),  and  {b)  the  estimates 
are  used  in  measuring  goodness  of  fit  ot 
the  model  to  the  data  in  a  detailed  way. 
For  the  control  group,  we  assume  that 
the  same  model  applies  and  then  esti- 
mate both  parameters  and  determine 
whether  or  not  the  assumption  that 
a2  =  1  is  tenable. 

The  procedure  used  to  estimate  the 
two  reward  parameters  cannot  be  de- 
scribed in  detail  here.  It  uses  the  first 
three  moments  of  the  observed  distri- 
butions of  successes  in  each  block  of  10 
trials;  these  are  used  in  conjunction  with 
formulas  for  moments  of  the  p-value 
distributions  derived  by  Bush  and 
Mosteller  (2,  p.  98).  The  results,  how- 
ever, are  shown  in  Table  3.  It  can  be 
noted  that  the  secondary  reward  pa- 
rameter, a2,  is  larger  for  both  groups 
than  the  corresponding  primary  reward 
parameter,  ai.  This  confirms  the  ex- 
pectation that  primary  reward  is  more 
effective.  (A  small  value  of  a  implies  a 
more  effective  event  than  does  a  large 
value.)  For  the  control  group,  the 
value  of  a2  is  near  1.0  as  assumed  in  the 
model  for  the  control  group,  but  the 
fact  that  it  is  not  quite  1.0  suggests  that 
nonreward  is  slightly  reinforcing  even 
for  the  control  group.  The  result  that 
ai  is  less  for  the  experimental  group  than 
for  the  control  group  (primary  reward 
more  effective)  is  not  predicted  by  any 
of  the  models. 

The  relative  effects  of  primary  and 
secondary  reward  for  each  group  can  be 
estimated  as  follows.  We  note  that 
(915). 69  =  942  and  this  means  that 
secondary  reward  is  about  60%  as 
effective  as  primary  reward  for  the 
experimental  group.  Similarly,  (.956)-^^ 
=  .986,  and  so  secondary  reward  is  about 
30%  effective  for  the  control  group. 
These  percentages  may  be  in  error  ap- 
preciably because  of  the  sampling  errors 
in  the  parameter  estimates,  but  they  do 
indicate  roughly  the  effects. 

Stat-fish. — A  convenient  way  of  com- 
paring model  predictions  with  data  is  to 
run  Monte  Carlo  computations  or  "stat- 
fish"  as  described  elsewhere  (2,  pp.  129- 
131,  251-252).  One  hundred  runs  of 
140  trials  each  were  carried  out  on  IBM 
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machines^  using  the  parameter  values 
given  in  Table  3  for  the  experimental 
group.  From  these  100  runs,  a  stratified 
sample  of  22  runs  was  drawn  such  that 
the  initial  distribution  of  probabilities 
would  approximate  the  symmetric  beta 
distribution  with  the  parameter  s  =  .7. 
These  22  stat-fish  can  then  be  compared 
directly  with  the  22  paradise  fish  in  the 
experimental  group. 

The  "learning  curve"  of  the  stat-fish 
is  shown  in  Fig.  2  along  with  those  of  the 
real  fish.  It  can  be  seen  that  the  stat- 
fish  curve  is  slightly  above  the  curve  for 
the  experimental  group.  This  should 
not  be  interpreted  as  a  discrepancy 
between  the  model  and  the  data.  Rather 
it  is  some  indication  of  how  well  the 
model  parameters  were  estimated  from 
the  data.  Loosely  speaking,  the  esti- 
mates were  obtained  by  requiring  that 
the  learning  rates  of  the  model  popu- 
lation and  of  the  experimental  sample 
be  equal.  To  measure  goodness  of  fit 
we  must  look  at  other  properties  of  the 
data. 

The  near-asymptotic  distribution  of 
successes  of  the  22  stat-fish  was  obtained 
in  the  same  manner  as  for  the  real  fish. 
The  results  are  shown  in  the  third  column 
of  Table  2  and  are  sufficiently  close  to  the 
corresponding  frequencies  of  the  experi- 
mental group  that  we  consider  formal 
tests  for  goodness  of  fit  would  be  super- 
fluous. 

Many  sequential  properties  of  the  data 
can  be  compared  to  the  corresponding 
properties  of  the  stat-fish  "data"  in 
order  to  obtain  further  measures  of 
goodness  of  fit.  Thus  we  have  tabulated 
the  distribution  of  runs  (of  successes  and 
failures)  for  the  experimental  group  and 
for  the  stat-fish.  In  Table  4  we  show 
the  mean  and  SD  of  the  total  number  of 
runs,  of  the  number  of  runs  of  various 
lengths,  as  well  as  the  number  of  suc- 
cesses per  S.  It  can  be  seen  that  all 
but  one  of  the  tabulated  means  are 
slightly  smaller  for  the  real  fish  than  for 
the  stat-fish,  and  that  the  variability  of 
these  measures  is  less  for  the  real  fish. 

'  We  are  indebted  to  B.  P.  Cohen  and  P.  D. 
Seymour  for  making  these  computations. 


TABLE  4 

Comparison  of  Statistics  Computed  from  the 

Data  for  the  Experimental  Group  of 

22  Fish  and  from  the  Sequences 

Obtained  from  the  22  Stat-Fish 


Experi- 

mental 

Stat-Fish 

Statistic 

Group 

Mean 

SD 

Mean 

SD 

Total  number  runs 

27.3 

13.7 

29.8 

20.2 

Runs  of  length  1 

12.9 

6.7 

14.2 

10.1 

Runs  of  length  2 

4.4 

3.8 

S.3 

6.3 

Runs  of  length  3 

2.0 

1.8 

3.0 

2.6 

Runs  of  length  4 

1.5 

1.6 

2.2 

2.2 

Runs  of  length  5 

1.1 

1.3 

1.0 

1.4 

Number  successes 

81.3 

48  .G 

87.6 

48.2 

All  these  discrepancies  are  a  result  of 
the  fact  that  two  of  the  stat-fish  never 
chose  the  unfavorable  side  and  two 
others  chose  it  only  once  each.  These 
four  stat-fish  had  initial  success  proba- 
bilities of  .95,  .95,  .85,  and  .85,  respec- 
tively. The  smallest  number  of  failures 
by  the  real  fish  is  five.  This  suggests 
that  better  agreement  would  have  been 
found  if  the  initial  distribution  of  proba- 
bilities had  had  less  density  in  the 
extremes;  the  symmetric  beta  distri- 
bution was  used  only  as  an  approxi- 
mation to  the  true  initial  distribution. 
Furthermore,  learning  during  the  first 
10  trials  tends  to  spread  out  the  distri- 
bution of  response  probabilities  and  so 
the  true  initial  distribution  probably  had 
less  variance  than  the  symmetric  beta 
distribution  used  in  the  stat-fish  com- 
putations. 

The  distributions  of  the  statistics 
given  in  Table  4  for  the  real  fish  and 
stat-fish  can  be  compared  in  the  same 
manner  as  used  to  compare  two  groups 
of  iSs.  The  distributions  are  not  normal 
and  so  we  used  the  Mann-Whitney  test 
(13).  Comparison  of  each  of  the  seven 
statistics  listed  in  Table  4  led  to  P 
values  greater  than  .3.  Thus,  we  con- 
clude that  the  model  adequately  de- 
scribes much  of  the  fine-grain  character 
of  the  data. 
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Summary 

A  two-choice  experiment  designed  to  provide 
Ss  with  complete  information  about  the  out- 
comes of  each  choice  on  each  trial  is  described. 
The  Ss  were  49  red  paradise  fish  divided  into 
two  groups;  the  control  Ss  were  run  with  the 
conventional  procedure  whereas  the  experi- 
mental Ss  were  given  an  opportunity  to  observe 
the  presence  or  absence  of  food  on  both  sides  of 
the  maze.  Both  groups  were  rewarded  on  one 
side  75%  of  the  time  and  on  the  other  side  the 
remaining  25%  of  the  time. 

Two  stochastic  models  for  predicting  the 
behavior  of  the  experimental  group  are  dis- 
cussed. The  "information  model"  assumed  an 
increment  in  the  probability  of  a  iish  choosing 
on  a  particular  trial  the  side  on  which  food  was 
placed  on  the  preceding  trial.  This  model 
predicts  that  the  distribution  of  choices  ap- 
proaches about  .75  for  all  fish.  The  "secondary 
reinforcement  model,"  on  the  other  hand,  as- 
sumes that  sight  of  food  in  the  opposite  goal  box 
reinforces  the  response  just  made  and  predicts 
that  individual  fish  will  approach  100%  choice 
of  one  side  or  the  other. 

The  data  obtained  support  the  secondary 
reinforcement  model.  Parameters  which  meas- 
ure the  effectiveness  of  primary  and  secondary 
reward  are  estimated  from  the  data  and  then 
detailed  comparisons  between  model  predictions 
and  experimental  results  are  made.  It  is  con- 
cluded that  the  secondary  reinforcement  model 
adequately  describes  much  of  the  fine-grain 
structure  of  the  data. 
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TOWARD  A  STATISTICAL  THEORY  OF  LEARNING  * 


BY  WILLIAM  K.  ESTES 
Indiana  University 


Improved  experimental  techniques 
for  the  study  of  conditioning  and  simple 
discrimination  learning  enable  the  pres- 
ent day  investigator  to  obtain  data 
which  are  sufficiently  orderly  and  re- 
producible to  support  exact  quantita- 
tive predictions  of  behavior.  Analogy 
with  other  sciences  suggests  that  full 
utilization  of  these  techniques  in  the 
analysis  of  learning  processes  will  de- 
pend to  some  extent  upon  a  comparable 
refinement  of  theoretical  concepts  and 
methods.  The  necessary  interplay  be- 
tween theory  and  experiment  has  been 
hindered,  however,  by  the  fact  that 
none  of  the  many  current  theories  of 
learning  commands  general  agreement 
among  researchers.  It  seems  likely  that 
progress  toward  a  common  frame  of 
reference  will  be  slow  so  long  as  most 
theories  are  built  around  verbally  de- 
fined hypothetical  constructs  which  are 
not  susceptible  to  unequivocal  verifica- 
tion. While  awaiting  resolution  of  the 
many  apparent  disparities  among  com- 
peting theories,  it  may  be  advantageous 
to  systematize  well  established  empiri- 
cal relationships  at  a  peripheral,  statis- 
tical level  of  analysis.  The  possibility 
of  agreement  on  a  theoretical  frame- 
work, at  least  in  certain  intensively 
studied  areas,  may  be  maximized  by 
defining  concepts  in  terms  of  experi- 
mentally manipulable  variables,  and 
developing  the  consequences  of  assump- 
tions by  strict  mathematical  reasoning. 

This  essay  will  introduce  a  series  of 

*  For  continual  reinforcement  of  his  efforts 
at  theory  construction,  as  well  as  for  many 
specific  criticisms  and  suggestions,  the  writer 
is  indebted  to  his  colleagues  at  Indiana  Uni- 
versity, especially  Cletus  J.  Burke,  Douglas 
G.  Ellson,  Norman  Guttman,  and  William  S. 
Verplanck. 

This   article    appeared   in   Psychol.   Rev., 


studies  developing  a  statistical  theory  of 
elementary  learning  processes.  From 
the  definitions  and  assumptions  which 
appear  necessary  for  this  kind  of  for- 
mulation, we  shall  attempt  to  derive 
relations  among  commonly  used  meas- 
ures of  behavior  and  quantitative  ex- 
pressions describing  various  simple 
learning  phenomena. 

Preliminary  Considerations 

Since  propositions  concerning  psy- 
chological events  are  verifiable  only 
to  the  extent  that  they  are  reducible  to 
predictions  of  behavior  under  specified 
environmental  conditions,  it  appears 
likely  that  greatest  economy  and  con- 
sistency in  theoretical  structure  will 
result  from  the  statement  of  all  funda- 
mental laws  in  the  form 

R  =  f(S), 

where  R  and  5  represent  behavioral 
and  environmental  variables  respectively. 
Response-inferred  laws,  as  for  example 
those  of  differential  psychology,  should 
be  derivable  from  relationships  of  this 
form.  The  reasoning  underlying  this 
position  has  been  developed  in  a  recent 
paper  by  Spence  (8).  Although  devel- 
oped within  this  general  framework,  the 
present  formulation  departs  to  some  ex- 
tent from  traditional  definitions  of  5 
and  R  variables. 

Many  apparent  differences  among 
contemporary  learning  theories  seem  to 
be  due  in  part  to  an  oversimplified  defi- 
nition of  stimulus  and  response.  The 
view  of  stimulus  and  response  as  ele- 
mentary, reproducible  units  has  always 
had  considerable  appeal  because  of  its 
simplicity.  This  simplicity  is  deceptive, 
however,  since  it  entails  the  postulation 
of  various  hypothetical  processes  to  ac- 
1950,  57,  94-107.    Reprinted  with  permission. 
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count  for  observed  variability  in  be- 
havior. In  the  present  formulation,  we 
shall  follow  the  alternative  approach  of 
including  the  notion  of  variability  in 
the  definitions  of  stimulus  and  response, 
and  investigating  the  theoretical  conse- 
quences of  these  definitions. 

It  will  also  be  necessary  to  modify 
the  traditional  practice  of  stating  laws 
of  learning  in  terms  of  relations  between 
isolated  stimuli  and  responses.  At- 
tempts at  a  quantitative  description  of 
learning  and  extinction  of  operant  behav- 
ior have  led  the  writer  to  believe  that 
a  self-consistent  theory  based  upon  the 
classical  S-R  model  may  be  difficult,  if 
not  impossible,  to  extend  over  any  very 
wide  range  of  learning  phenomena  with- 
out the  continual  addition  of  ad  hoc 
hypotheses  to  handle  every  new  situ- 
ation. A  recurrent  difficulty  might  be 
described  as  follows.  In  most  formula- 
tions of  simple  learning,  the  organism 
is  said  originally  to  "do  nothing"  in 
the  presence  of  some  stimulus;  during 
learning,  the  organism  comes  to  make 
some  predesignated  response  in  the  pres- 
ence of  the  stimulus;  then  during  ex- 
tinction, the  response  gradually  gives 
way  to  a  state  of  "not  responding" 
again.  But  this  type  of  formulation 
does  not  define  a  closed  or  conservative 
system  in  any  sense.  In  order  to  derive 
properties  of  conditioning  and  extinc- 
tion from  the  same  set  of  general  laws, 
it  is  necessary  to  assign  specific  proper- 
ties to  the  state  of  not  responding 
which  is  the  alternative  to  occurrence 
of  the  designated  response.  One  solu- 
tion is  to  assign  properties  as  needed 
by  special  hypotheses,  as  has  been  done, 
for  example,  in  the  Pavlovian  concep- 
tion of  inhibition.  In  the  interest  of 
simplicity  of  theoretical  structure,  we 
shall  avoid  this  procedure  so  far  as 
possible. 

The  role  of  competing  reactions  has 
been  emphasized  by  some  writers,  but 
usually  neglected  in  formal  theorizing. 


The  point  of  view  to  be  developed  here 
will  adopt  as  a  standard  conceptual 
model  a  closed  system  of  behavioral  and 
environmental  variables.  In  any  spe- 
cific behavior-system,  the  environmental 
component  may  include  either  the  en- 
tire population  of  stimuli  available  in 
the  situation  or  some  specified  portion 
of  that  population.  The  behavioral 
component  will  consist  in  mutually  ex- 
clusive classes  of  responses,  defined  in 
terms  of  objective  criteria;  these  classes 
will  be  exhaustive  in  the  sense  that  they 
will  include  all  behaviors  which  may 
be  evoked  by  that  stimulus  situation. 
Given  the  initial  probabilities  of  the 
various  responses  available  to  an  organ- 
ism in  a  given  situation,  we  shall  expect 
the  laws  of  the  theory  to  enable  predic- 
tions of  changes  in  those  probabilities 
as  a  function  of  changes  in  values  of 
independent  variables. 

Definitions  and  Assumptions 

1.  R-variables.  It  will  be  assumed 
that  any  movement  or  sequence  of 
movements  may  be  analyzed  out  of  an 
organism's  repertory  of  behavior  and 
treated  as  a  "response,"  various  prop- 
erties of  which  can  be  treated  as  de- 
pendent variables  subject  to  all  the  laws 
of  the  theory.  (Hereafter  we  shall  ab- 
breviate the  word  response  as  R,  with 
appropriate  subscripts  where  neces- 
sary.) In  order  to  avoid  a  common 
source  of  confusion,  it  will  be  necessary 
to  make  a  clear  distinction  between  the 
terms  i?-class  and  /^-occurrence. 

The  term  i?-class  will  always  refer 
to  a  class  of  behaviors  which  produce 
environmental  effects  within  a  specified 
range  of  values.  This  definition  is  not 
without  objection  (c/.  4)  but  has  the 
advantage  of  following  the  actual  prac- 
tice of  most  experimenters.  It  may  be 
possible  eventually  to  coordinate  R- 
classes  defined  in  terms  of  environ- 
mental effects  with  i?-classes  defined  in 
terms  of  effector  activities. 
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By  i?-occurrence  we  shall  mean 
a  particular,  unrepeatable  behavioral 
event.  All  occurrences  which  meet  the 
defining  criteria  of  an  7?-class  are 
counted  as  instances  of  that  class,  and 
as  such  are  experimentally  interchange- 
able. In  fact,  various  instances  of  an 
IJ-class  are  ordinarily  indistinguishable 
in  the  record  of  an  experiment  even 
though  they  may  actually  vary  with 
respect  to  properties  which  are  not 
picked  up  by  the  recording  mechanism. 

Indices  of  tendency  to  respond,  e.g., 
probability  as  defined  below,  always 
refer  to  i?-classes. 

These  distinctions  may  be  clarified 
by  an  illustration.  In  the  Skinner-type 
conditioning  apparatus,  bar-pressing  is 
usually  treated  as  an  i^-class.  Any 
movement  of  the  organism  which  re- 
sults in  sufficient  depression  of  the  bar 
to  actuate  the  recording  mechanism  is 
counted  as  an  instance  of  the  class. 
The  -R-class  may  be  subdivided  into 
finer  classes  by  the  same  kind  of  cri- 
teria. We  could,  if  desired,  treat  de- 
pression of  a  bar  by  the  rat's  right 
forepaw  and  depression  of  the  bar  by 
the  left  forepaw  as  instances  of  two 
different  classes  provided  that  we  have 
a  recording  mechanism  which  will  be 
affected  differently  by  the  two  kinds  of 
movements  and  mediate  different  rela- 
tions to  stimulus  input  (as  for  example 
the  presentation  of  discriminative  stim- 
uli or  reinforcing  stimuli).  If  proba- 
bility is  increased  by  reinforcement, 
then  reinforcement  of  a  right-forepaw- 
bar-depression  will  increase  the  proba- 
bility that  instances  of  that  subclass 
will  occur,  and  will  also  increase  the 
probability  that  instances  of  the  broader 
class,  bar-pressing,  will  occur. 

2.  S-variables.  For  analytic  pur- 
poses it  is  assumed  that  all  behavior 
is  conditional  upon  appropriate  stimu- 
lation. It  is  not  implied,  however,  that 
responses  can  be  predicted  only  when 
eliciting  stimuli  can  be  identified.    Ac- 


cording to  the  present  point  of  view, 
laws  of  learning  enable  predictions  of 
changes  in  probability  of  response  as  a 
function  of  time  under  given  environ- 
mental conditions. 

A  stimulus,  or  stimulating  situation, 
will  be  regarded  as  a  finite  population 
of  relatively  small,  independent,  en- 
vironmental events,  of  which  only  a 
sample  is  effective  at  any  given  time. 
In  the  following  sections  we  shall  desig- 
nate the  total  number  of  elements  as- 
sociated with  a  given  source  of  stimula- 
tion as  5  (with  appropriate  subscripts 
where  more  than  one  source  of  stimu- 
lation must  be  considered  in  an  experi- 
ment), and  the  number  of  elements  ef- 
fective at  any  given  time  as  s.  It  is 
assumed  that  when  experimental  condi- 
tions involve  the  repeated  stimulation 
of  an  organism  by  the  "same  stimulus," 
that  is  by  successive  samples  of  ele- 
ments from  an  5-population,  each  sam- 
ple may  be  treated  as  an  independent 
random  sample  from  S.  It  is  to  be  ex- 
pected that  sample  size  will  fluctuate 
somewhat  from  one  moment  to  the  next, 
in  which  case  s  will  be  treated  as  the 
average  number  of  elements  per  sample 
over  a  given  period. 

In  applying  the  theory,  any  portion 
of  the  environment  to  which  the  or- 
ganism is  exposed  under  uniform  condi- 
tions may  be  considered  an  5-popula- 
tion.  The  number  of  different  S's  said 
to  be  present  in  a  situation  will  depend 
upon  the  number  of  independent  ex- 
perimental operations,  and  the  degree  of 
specificity  with  which  predictions  of 
behavior  are  to  be  made.  If  the  experi- 
menter attempts  to  hold  the  stimulating 
situation  constant  during  the  course  of 
an  experiment,  then  the  entire  situa- 
tion will  be  treated  as  a  single  5.  If 
in  a  conditioning  experiment,  a  light 
and  shock  are  to  be  independently  ma- 
nipulated as  the  CS  and  US,  then  each 
of  these  sources  of  stimulation  will  be 
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treated  as  a  separate  5-population,  and 
so  on. 

It  should  be  emphasized  that  the 
division  of  environment  and  behavior 
into  elements  is  merely  an  analytic 
device  adopted  to  enable  the  applica- 
tion of  the  finite-frequency  theory  of 
probability  to  behavioral  phenomena. 
In  applying  the  theory  to  learning  ex- 
periments we  shall  expect  to  evaluate 
the  ratio  s/S  for  any  specific  situa- 
tion from  experimental  evidence,  but 
for  the  present  at  least  no  operational 
meaning  can  be  given  to  a  numerical 
value  for  either  S  or  s  taken  separately. 

3.  Probability  of  response.  Proba- 
bility will  be  operationally  defined  as 
the  average  frequency  of  occurrence  of 
instances  of  an  i?-class  relative  to  the 
maximum  possible  frequency,  under  a 
specified  set  of  experimental  conditions, 
over  a  period  of  time  during  which  the 
conditions  remain  constant.  In  accord- 
ance with  customary  usage  the  term 
probability,  although  defined  as  a  rela- 
tive frequency,  will  also  be  used  to  ex- 
press the  likelihood  that  a  response  will 
occur  at  a  given  time. 

4.  Conditional  relation.  This  relation 
may  obtain  between  an  i?-class  and  any 
number  of  the  elements  in  an  5-popula- 
tion, and  has  the  following  implications. 

(a)  If  a  set  of  a;  elements  from  an  S 
are  conditioned  to  {i.e.,  have  the  con- 
ditional relation  to)  some  i?-class,  R^, 
at  a  given  time,  the  probability  that  the 
next  response  to  occur  will  be  an  in- 
stance of  R^  is  x/S. 

(b)  If  at  a  given  time  in  an  5- 
population,  x^^  elements  are  conditioned 
to  some  i?-class,  i?i,  and  X2  elements  are 
conditioned  to  another  class,  R2,  then 
Xj  and  Xo  have  no  common  elements. 

(c)  If  all  behaviors  which  may  be 
evoked  from  an  organism  in  a  given 
situation  have  been  categorized  into 
mutually  exclusive  classes,  then  the 
probabilities  attaching  to  the  various 
classes  must  sum  to  unity  at  all  times. 


We  consider  the  organism  to  be  always 
"doing  something."  If  any  arbitrarily 
defined  class  of  activities  may  be  se- 
lected as  the  dependent  variable  of  a 
given  experiment,  it  follows  that  the 
activity  of  the  organism  at  any  time 
must  be  considered  as  subject  to  the 
same  laws  as  the  class  under  considera- 
tion. Any  increase  in  probability  of 
one  i?-class  during  learning  will,  then, 
necessarily  involve  the  reduction  in 
probability  of  other  classes;  similarly, 
while  the  probability  of  one  R  de- 
creases during  extinction,  the  probabili- 
ties of  others  must  increase.  In  other 
words,  learning  and  unlearning  will  be 
considered  as  transfers  of  probability 
relations  between  i?-classes. 

5.  Conditioning.  It  is  assumed  that 
on  each  occurrence  of  a  response,  R^, 
all  new  elements  {i.e.,  elements  not  al- 
ready conditioned  to  R^)  in  the  mo- 
mentarily effective  sample  of  stimulus 
elements,  s,  become  conditioned  to  i?i. 

An  important  implication  of  these 
definitions  is  that  the  conditioning  of  a 
stimulus  element  to  one  R  automatically 
involves  the  breaking  of  any  pre-existing 
conditional  relations  with  other  i?'s. 

6.  Motivation.  Experimental  opera- 
tions which  in  the  usual  terminology 
are  said  to  produce  motives  {e.g.,  food- 
deprivation)  may  affect  either  the  com- 
position of  an  5  or  the  magnitude  of 
the  s/S  ratio.  Detailed  discussion  of 
these  relations  is  beyond  the  scope  of 
the  present  paper.  In  all  derivations 
presented  here  we  shall  assume  motivat- 
ing conditions  constant  throughout  an 
experiment. 

7.  Reinforcement.  This  term  will  be 
applied  to  any  experimental  condition 
which  ensures  that  successive  occur- 
rences of  a  given  R  will  each  be  con- 
tiguous with  a  new  random  sample  of 
elements  from  some  specified  5-popula- 
tion. Various  ways  of  realizing  this 
definition  experimentally  will  be  dis- 
cussed in  the  following  sections. 
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Simple  Conditioning:    Reinforce- 
ment BY  Controlled  Elicitation 

Let  us  consider  first  tlie  simplest  tjrpe 
of  conditioning  experiment.  The  sys- 
tem to  be  described  consists  of  a  sub- 
population  of  stimulus  elements,  So, 
which  may  be  manipulated  independ- 
ently of  the  remainder  of  the  situation, 
S,  and  a  class,  R,  of  behaviors  defined 
by  certain  measurable  properties.  By 
means  of  a  controlled  original  stimulus, 
that  is,  one  which  has  initially  a  high 
probability  of  evoking  R,  it  is  ensured 
that  an  instance  of  R  will  occur  on 
every  trial  contiguously  with  the  sam- 
ple of  stimulus  elements  which  is  pres- 
ent. In  the  familiar  buzz-shock  con- 
ditioning experiment,  for  example.  So 
would  represent  the  population  of  stimu- 
lus elements  emanating  from  the  sound 
source  and  R  would  include  all  move- 
ments of  a  limb  meeting  certain  speci- 
fications of  direction  and  amplitude; 
t5^ically,  the  R  to  be  conditioned  is  a 
flexion  response  which  may  be  evoked 
on  each  training  trial  by  administra- 
tion of  an  electric  shock. 

Designating  the  mean  number  of  ele- 
ments from  So  effective  on  any  one  trial 
as  So,  and  the  number  of  elements  from" 
5c  which  are  conditioned  to  R  at  any 
time  as  x,  the  expected  number  of  new 
elements  conditioned  on  any  trial  will 
be 


tegrated  to  yield 


Ax  =  Sc 


(Sc  —  x) 


(1) 


If  the  change  in  x  per  trial  is  rela- 
tively small,  and  the  process  is  assumed 
continuous,  the  right  hand  portion  of 
(1)  may  be  taken  as  the  average  rate 
of  change  of  x  with  respect  to  number 
of  trials,  T,  at  any  moment,  giving 

dx_  _      (Sc  -  x) 

dT  ~  ''       Sc       '  '  ^ 

This  differential  equation  may  be  in- 


X  =  Sc  —  (Sc  —  Xo)e 


-qt 


(3) 


where  jCq  is  the  initial  value  of  x,  and  q 
represents  the  ratio  Sc/So.  Thus  x  will 
increase  from  its  initial  value  to  ap-  | 
proach  the  limiting  value,  Sc,  in  a  nega-  ' 
tively  accelerated  curve.  A  method  of 
evaluating  x  in  these  equations  from 
empirical  measures  of  response  latency, 
or  reaction  time,  will  be  developed  in  a 
later  section. 

If  the  remainder  of  the  situation  has 
been  experimentally  neutralized,  the 
probability  of  R  in  the  presence  of  a 
sample  from  5c  will  be  given  by  the 
ratio  x/Sc.  Representing  this  ratio  by 
the  single  letter  p,  and  making  appro- 
priate substitutions  in  (3),  we  have  the 
following  expression  for  probability  of 
i?  as  a  function  of  the  number  of  rein-  I 
forced  trials.  ' 


^  =  1  -  (1  -  po)e 


-qt 


(30 


Since  we  have  not  assumed  any  spe- 
cial properties  for  the  original  (or  un- 
conditioned) stimulus  other  than  that 
of  regularly  evoking  the  response  to  be 
conditioned,  it  is  to  be  expected  that 
the  equations  developed  in  this  section 
will  describe  the  accumulation  of  con- 
ditional relations  in  other  situations 
than  classical  conditioning,  provided 
that  other  experimental  operations  func- 
tion to  ensure  that  the  response  to  be 
learned  will  occur  in  the  presence  of 
every  sample  drawn  from  the  5-popula- 
tion. 

Operant  Conditioning:  Reinforce- 
ment BY  Contingent  Stimulation 

In  the  more  common  type  of  experi- 
mental arrangement,  various  termed 
operant,  instrumental,  trial  and  error, 
etc.  by  different  investigators,  the  re- 
sponse to  be  learned  is  not  elicited  by 
a  controlled  original  stimulus,  but  has 
some  initial  strength  in  the  experimental 
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situation  and  occurs  originally  as  part 
of  so-called  "random  activity."  Here 
the  response  cannot  be  evoked  concur- 
rently with  the  presentation  of  each  new 
stimulus  sample,  but  some  of  the  same 
effects  can  be  secured  by  making 
changes  in  the  stimulating  situation 
contingent  upon  occurrences  of  the  re- 
sponse. Let  us  consider  a  situation  of 
this  sort,  assuming  that  the  activities  of 
the  organism  have  been  catalogued  and 
classified  into  two  categories,  all  move- 
ment sequences  characterized  by  a  cer- 
tain set  of  properties  being  assigned  to 
class  R  and  all  others  to  the  class  Re, 
and  that  members  of  class  R  are  to  be 
learned. 

If  changes  in  the  stimulus  sample  are 
independent  of  the  organism's  behavior, 
we  should  expect  instances  of  the  two 
response  classes  to  occur,  on  the  aver- 
age, at  rates  proportional  to  their  initial 
probabilities.  For  if  x  elements  from 
the  5-population  are  originally  condi- 
tioned to  R,  then  the  probability  of  R 
will  be  x/S;  the  number  of  new  ele- 
ments conditioned  to  R  if  an  instance 
occurs  will  be  ^[(S  —  a!;)/5],  s  again 
representing  the  number  of  stimulus 
elements  in  a  sample;  and  the  mathe- 
matically expected  increase  in  x  will 
be  the  product  of  these  quantities, 
sx[(S  —  x)/S^].  At  the  same  time, 
the  probability  of  Rg  will  be  (5  —  x)/S, 
and  the  number  of  new  elements  condi- 
tioned to  Re  if  an  instance  occurs  will 
be  sx/S;  multiplying  these  quantities, 
we  have  sx[{S  —  x)/5^]  as  the  mathe- 
matically expected  decrease  in  x.  Thus 
we  should  predict  no  average  change 
in  X  under  these  conditions. 

In  the  acquisition  phase  of  a  learn- 
ing experiment  two  important  restric- 
tions imposed  by  the  experimenter  tend 
to  force  a  correlation  between  changes 
in  the  stimulus  sample  and  occurrences 
of  R.  The  organism  is  usually  intro- 
duced into  the  experimental  situation 
at  the  beginning  of  a  trial,  and  the 


trial  lasts  until  the  pre-designated  re- 
sponse, R,  occurs.  For  example,  in  a 
common  discrimination  apparatus  the 
animal  is  placed  on  a  jumping  stand  at 
the  beginning  of  each  trial  and  the  trial 
continues  until  the  animal  leaves  the 
stand;  a  trial  in  a  runway  experiment 
lasts  until  the  animal  reaches  the  end 
box,  and  so  on.  Typically  the  stimulat- 
ing situation  present  at  the  beginning 
of  a  trial  is  radically  changed,  if  not 
completely  terminated,  by  the  occur- 
rence of  the  response  in  question ;  and  a 
new  trial  begins  under  the  same  condi- 
tions, except  for  sampling  variations, 
after  some  pre-designated  interval.  The 
pattern  of  movement-produced  stimuli 
present  during  a  trial  may  be  changed 
after  occurrences  of  i?  by  the  evocation 
of  some  uniform  bit  of  behavior  such  as 
eating  or  drinking;  in  some  cases  the 
behavior  utilized  for  this  purpose  must 
be  established  by  special  training  prior 
to  a  learning  experiment.  In  the 
Skinner  box,  for  example,  the  animal  is 
trained  to  respond  to  the  sound  of  the 
magazine  by  approaching  it  and  eating 
or  drinking.  Then  when  operation  of 
the  magazine  follows  the  occurrence  of 
a  bar-pressing  response  during  condi- 
tioning of  the  latter,  the  animal's  re- 
sponse to  the  magazine  will  remove  it 
from  the  stimuli  in  the  vicinity  of  the 
bar  and  ensure  that  for  an  interval  of 
time  thereafter  the  animal  will  not  be 
exposed  to  most  of  the  5-population; 
therefore  the  sample  of  elements  to 
which  the  animal  will  next  respond  may 
be  considered  very  nearly  a  new  random 
sample  from  S. 

In  the  simplest  operant  conditioning 
experiments  it  may  be  possible  to 
change  almost  the  entire  stimulus  sam- 
ple after  each  occurrence  of  jR  (com- 
plete reinforcement),  while  in  other 
cases  the  sampling  of  only  some  re- 
stricted portion  of  the  5-population  is 
correlated  with  R  (partial  reinforce- 
ment).    We  shall  consider  the  former 
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case  in  some  detail  in  the  remainder  of 
this  section. 

By  our  definition  of  the  conditional 
relation,  we  shall  expect  all  i?-classes 
from  which  instances  actually  occur  on 
any   trial   to   be   conditioned   to   stim- 
ulus   elements    present    on    that    trial. 
The   first  movement   to   occur  will  be 
conditioned  to  the  environmental  cues 
present  at  the  beginning  of  the  trial; 
the    next    movement    will    be    condi- 
tioned  to   some   external   cues,   if   the 
situation    is    not    completely    constant 
during   a   trial,   and   to   proprioceptive 
cues  from  the  first  movement,  and  so  on, 
until    the    predesignated    response,    R, 
occurs    and    terminates    the    trial.      If 
complete  constancy  of  the  stimulating 
situation  could  be  maintained,  the  most 
probable  course  of  events  on  the  next 
trial  would   be   the   recurrence   of   the 
same  sequence  of  movements.    In  prac- 
tice, however,  the  sample  of  effective 
stimulus  elements  will  change  somewhat 
in    composition,    and    some    responses 
which  occur  on  one  trial  may  fail  to  oc- 
cur on  the  next.     The  only  response 
which  may  never  be  omitted  is  R,  since 
the  trial  continues  until  R  occurs.    This 
argument  has  been  developed  in  greater 
detail   by   Guthrie    (4).     In   order   to 
verify  the  line  of  reasoning  involved, 
we  need  now  to  set  these  ideas  down  in 
mathematical  form  and  investigate  the 
possibility  of  deriving  functions  which 
will  describe  empirical  curves  of  learn- 
ing. 

Since  each  trial  lasts  until  R  occurs, 
we  need  an  expression  for  the  probable 
duration  of  a  trial  in  terms  of  the 
strength  of  R.  Suppose  that  we  have 
categorized  all  movement  sequences 
which  are  to  be  counted  as  "responses" 
in  a  given  situation,  and  that  the  mini- 
mum time  needed  for  completion  of  a 
response-occurrence  is,  on  the  average,  h. 
For  convenience  in  the  following  devel- 
opment, we  shall  assume  that  the  mean 
duration  of  instances  of  class  R  is  ap- 


proximately equal  to  that  of  class  Re. 
Let  the  total  number  of  stimulus  ele- 
ments   available    in    the    experimental 
situation  be  represented  by  5,  the  sam- 
ple effective  on  any  one  trial  by  s,  and 
the  ratio  s/S  by  q.    The  probability,  p, 
of  class  R  at  the  beginning  of  any  trial 
will  have  the  value  x/S;  if  this  value 
varies  little  within  a  trial,  we  can  readily 
compute  the  probable  number  of  re- 
sponses (of  all  classes)  that  will  occur 
before   the    trial   is   terminated.     The 
probability  that  an  instance  of  R  will 
be  the  first  response  to  occur  on  the 
trial  in  question  is  p;  the  probability 
that  it  will  be  the  second  is  p  (l—p); 
the  probability  that  it  will  be  the  third 
is  p{l  —  p)^;  etc.    If  we  imagine  an  in-     ; 
definitely  large  number  of  trials  run  un-    " 
der  identical  conditions,  and  represent 
the  number  of  response  occurrences  on 
any  trial  by  n,  we  may  weight  each  pos- 
sible value  of  n  by  its  probability  (i.e., 
expected  relative  frequency)  and  obtain 
a  mean  expected  value  of  n.    In  S3mi- 
bolic  notation  we  have 

n  =  Znpil  -  p)"-i  =  pZnil  -  py-K 

The  expression  inside  the  summation 
sign  will  be  recognized  as  the  general 
term  of  a  well-known  infinite  series  with 
the  sum  l/{l-{l-p)y.  Then  we 
have,  by  substitution, 

n  =  p/{i  -  (1  -  p)y  =  1/p. 

Then  L,  the  average  time  per  trial,  will 
be  the  product  of  the  expected  number 
of  responses  and  the  mean  time  per 
response. 

L  =  nh  =  h/p  =  Sh/x. 

Since  R  will  be  conditioned  to  all  new 
stimulus  elements  present  on  each  trial, 
we  may  substitute  for  x  its  equivalent 
from  equation   (3),  dropping  the  sub- 
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scripts  from  So  and  So,  and  obtaining      in  the  figure  represents  the  equation 


L  = 


Sh 


S  -  (S  -  xo)e-^'' 


h 


(Lo  -  h^e-^"^ 


(4) 


1  - 


Thus,  L  will  decline  from  an  initial 
value  of  Lo  (equal  to  Sh/x^)  and  ap- 
proach the  asymptotic  minimum  value 
h  over  a  series  of  trials. 

A  preliminary  test  of  the  validity  of 
this  development  may  be  obtained  by 
applying  equation  (4)  to  learning  data 
from  a  runway  experiment  in  which  the 
conditions  assumed  in  the  derivation  are 
realized  to  a  fair  degree  of  approxima- 
tion. In  Fig.  1  we  have  plotted  acquisi- 
tion data  reported  by  Graham  and 
Gagne  (3).  Each  empirical  point  rep- 
resents the  geometric  mean  latency  for 
a  group  of  21  rats  which  were  rein- 
forced with  food  for  traversing  a  simple 
elevated  runway.    The  theoretical  curve 
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Fig.  1.  Latencies  of  a  runway  response 
during  conditioning,  obtained  from  published 
data  of  Graham  and  Gagne  (3),  are  fitted 
by  a  theoretical  curve  derived  in  the  text. 


L  = 


2.5 


1  -  .9648e-^2^' 


where  values  of  Lq,  h,  and  q  have  been 
estimated  from  the  data.  This  curve 
appears  to  give  a  satisfactory  gradua- 
tion of  the  obtained  points  and,  it  might 
be  noted,  is  very  similar  in  form  to  the 
theoretical  acquisition  curve  developed 
by  Graham  and  Gagne.  The  present 
formulation  differs  from  theirs  chiefly 
in  including  the  time  of  the  first  re- 
sponse as  an  integral  part  of  the  learn- 
ing process.  The  quantitative  descrip- 
tion of  extinction  in  this  situation  will 
be  presented  in  a  forthcoming  paper. 

In  order  to  apply  the  present  theory 
to  experimental  situations  such  as  the 
Skinner  box,  in  which  the  learning  pe- 
riod is  not  divided  into  discrete  trials, 
we  shall  have  to  assume  that  the  in- 
tervals between  reinforcements  in  those 
situations  may  be  treated  as  ''trials" 
for  analytical  purposes.  Making  this 
assumption,  we  may  derive  an  expres- 
sion for  rate  of  change  of  conditioned 
response  strength  as  a  function  of  time 
in  the  experimental  situation,  during  a 
period  in  which  all  responses  of  class  R 
are  reinforced. 

L,  as  defined  above,  will  represent  the 
time  between  any  two  occurrences  of  R. 
Then  if  we  let  t  represent  time  elapsed 
from  the  beginning  of  the  learning  pe- 
riod to  a  given  occurrence  of  R,  and  T 
the  number  of  occurrences  (and  there- 
fore reinforcements)  of  R,  we  have  from 
the  preceding  development 

L  =  Sh/x. 

Since  L  may  be  considered  as  the  in- 
crement in  time  during  a  trial,  we  can 
write  the  identity 

Ax       Ax     AT 
At   ^  AT  '~At' 

Substituting  for  Ax/ AT  its  equivalent 
from    (1),  without  subscripts,  and  for 
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AT/At  its  equivalent  from  the  preced- 
ing equation,  we  have 


Ax      s(S  —  x) 
At^        S 


X 

hS 


s(S  —  x)x 


(5) 


If  the  change  in  x  per  reinforcement  is 
small  and  the  process  is  assumed  con- 
tinuous, the  right  hand  portion  of  equa- 
tion (5)  may  be  taken  as  the  value  of 
the  derivative  dx/dt  and  integrated 
with  respect  to  time — 


5 


X  = 


1  + 


(S  —  Xo) 


(6) 


Xo 


where  B  =  s/Sh.  In  general,  this  equa- 
tion defines  a  logistic  curve  with  the 
amount  of  initial  acceleration  depend- 
ing upon  the  value  of  Xq.  Curves  of 
probability  (x/S)  vs.  time  for  S  =  100, 
B  =  0.25,  and  several  different  values 
of  Xq  are  illustrated  in  Fig.  2. 


Since  we  are  considering  a  situation 
in  which  a  reinforcement  is  administered 
(or  a  new  "trial"  is  begun)  after  each 
occurrence  of  R,  we  are  now  in  a  posi- 
tion to  express  the  expected  rate  of 
occurrence  of  i?  as  a  function  of  time. 
Representing  rate  of  occurrence  of  R 
by  y  =  dR/dt,  and  the  ratio  l/h  by  w, 
we  have 

dR      dT     wx  w 


^      dt       dt        S 


1  + 


{S  —  Xo) 


Xo 


and  if  we  take  the  rate  of  R  at  the  be- 
ginning of  the  experimental  period  as 
Tq  =  wXq/S  this  relation  becomes 


w 


1  + 


(W  —  To) 


(7) 


ro 


To   illustrate   this   function,   we   have 
plotted  in  Fig.  3  measures  of  rate  of 


1.00  r 


Fm.  2. 


Illustrative  curves  of  probability  vs.  time  during  conditioning;  parameters  of  the  curves 
are  the  same  except  for  the  initial  a;-values. 
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Fig.  3.    Number  of  responses  per  minute  during  conditioning  of  a  bar-pressing  habit  in  a 
single  rat;  the  theoretical  curve  is  derived  in  the  text. 


responding  during  conditioning  of  a  bar- 
pressing  response  by  a  single  rat.  The 
apparatus  was  a  Skinner  box;  motiva- 
tion was  24  hours  thirst;  the  animal  had 
previously  been  trained  to  drink  out  of 
the  magazine,  and  during  the  period  il- 
lustrated was  reinforced  with  water  for 
all  bar-pressing  responses.  Measures 
of  rate  at  various  times  were  obtained 
by  counting  the  number  of  responses 
made  during  the  half-minute  before 
and  the  half-minute  after  the  point 
being  considered,  and  taking  that  value 
as  an  estimate  of  the  rate  in  terms  of 
responses  per  minute  at  the  midpoint. 
The  theoretical  curve  in  the  figure  rep- 
resents the  equation 

13 


1  +  25e--24' 


A  considerable  part  of  the  variability 
of  the  empirical  points  in  the  figure  is 
due  to  the  inaccuracy  of  the  method 
of  estimating  rates.  In  order  to  avoid 
this  loss  of  precision,  the  writer  has 
adopted  the  practice  of  using  cumula- 
tive curves  of  responses  vs.  time  for 


w . 


i?  =  w^+-^log 


most  purposes,  and  fitting  the  cumula- 
tive records  with  the  integral  of  equa- 
tion(7): 

\w  w  / 

where  R  represents  the  number  of  re- 
sponses made  after  any  interval  of  time, 
t,  from  the  beginning  of  the  learning 
period.  The  original  record  of  re- 
sponses vs.  time,  from  which  the  data 
of  Fig.  3  were  obtained,  is  reproduced 
in  Fig.  4.  Integration  of  the  rate  equa- 
tion for  this  animal  yields 

i?  =  13/  +  125  logio  (.038  +  .962e--240. 

Magnitudes  of  R  computed  from  this 
equation  for  several  values  of  t  have 
been  plotted  in  Fig,  4  to  indicate  the 
goodness  of  fit;  the  theoretical  curve  has 
not  been  drawn  in  the  figure  since  it 
would  completely  obscure  most  of  the 
empirical  record.  In  an  experimental 
report  now  in  press  (2),  equation  (8) 
is  fitted  to  several  mean  conditioning 
curves  for  groups  of  four  rats;  in  all 
cases,  the  theoretical  curve  accounts  for 
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Fig.  4.  Reproduction  of  the  original  cu- 
mulative record  from  which  the  points  of  Fig. 
3  were  obtained.  Solid  circles  are  computed 
from  an  equation  given  in  the  text. 


more  than  99  per  cent  of  the  variance 
of  the  observed  R  values.  Further 
verification  of  the  present  formulation 
has  been  derived  from  that  study  by 
comparing  the  acquisition  curves  of  suc- 
cessively learned  bar-pressing  habits, 
obtained  in  a  Skinner-type  condition- 
ing apparatus  which  included  two  bars 
differing  only  in  position.  It  has  been 
found  that  the  parameters  w  and  s/S 
can  be  evaluated  from  the  conditioning 
curve  of  one  bar  response,  and  then 
used  to  predict  the  detailed  course  of 
conditioning  of  a  second  learned  re- 
sponse. 

The  overall  accuracy  of  these  equa- 
tions in  describing  the  rate  of  condi- 
tioning of  bar-pressing  and  runway  re- 
sponses should  not  be  allowed  to  obscure 
the  fact  that  a  small  but  systematic 
error  is  present  in  the  initial  portion 
of  most  of  the  curves.  It  is  believed 
that  these  disparities  are  due  to  the  fact 
that  experimental  conditions  do  not 
usually  fully  realize  the  assumption  that 
only  one  i?-class  receives  any  reinforce- 
ment during  the  learning  period.  A 
more  general  formulation  of  the  theory, 
which  does  not  require  this  assumption, 
will  be  discussed  in  the  next  section. 

Partial  Reinforcement 

It  can  be  shown  that  a  given  response 
may  be  "learned"  in  a  trial  and  error 
situation  provided  that  some  sub-popu- 


lation of  stimulus  elements  is  so  con- 
trolled by  experimental  conditions  that 
each  sample  of  elements  drawn  from 
it  is  contiguous  with  an  occurrence  of 
the  response.  The  sort  of  derivation 
needed  to  handle  this  kind  of  partial 
reinforcement  will  be  sketched  briefly 
in  this  section.  A  more  detailed  treat- 
ment will  be  given,  together  with  rele- 
vant experimental  evidence,  in  a  paper 
now  in  preparation.  It  should  be  em- 
phasized that  we  are  using  the  term 
"partial"  to  refer  to  incomplete  change 
of  the  stimulus  sample  on  each  occur-  j 
rence  of  a  given  response,  and  not  to  ^ 
periodic,  or  intermittent  reinforcement. 

Consider  a  behavior  system  involv- 
ing two  classes  of  competing  behaviors, 
R  and  Rg,  which  may  occur  in  a  situa- 
tion, 5,  composed  of  two  independently 
manipulable  sub-populations,  Sr  and  Sg. 
Experimental  conditions  are  to  ensure 
that  of  the  sample,  s,  of  elements  stimu- 
lating the  organism  at  any  time,  ele- 
ments from  Sr  remain  effective  until 
terminated  by  the  occurrence  of  R, 
while  elements  from  Se  remain  effective 
until  terminated  by  the  occurrence  of 
Rg.  This  kind  of  system  might  be  il- 
lustrated by  a  Skinner  box  in  which  the 
entire  stimulus  sample  is  not  terminated 
by  occurrence  of  the  bar-pressing  re- 
sponse; for  example,  if  the  box  is  il- 
luminated, the  visual  stimulation  will 
be  relatively  unaffected  by  bar-pressing 
but  will  be  terminated  if  the  animal 
closes  its  eyes  (the  latter  behavior 
being,  then,  an  instance  of  Rg). 

Let  X  represent  the  total  number  of 
elements  from  S  conditioned  to  i?  at  a 
given  time,  Xr  the  number  of  elements 
from  Sr  conditioned  to  R,  Tr  and  Tg 
the  numbers  of  occurrences  of  R  and 
Rg  prior  to  the  time  in  question,  and  q 
the  ratio  s/S.  By  reasoning  similar  to 
that  utilized  in  deriving  equations  (2) 
and  (5),  we  may  obtain  for  the  aver- 
age rate  of  change  of  Xr  with  respect  to 
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Tr  at  any  time 

dXr  Sr    {St   —   Xr) 

dfr    ^   ^"S'  Sl 

=   q(Sr   -  Xr).  (9) 

This  may  be  integrated  to  yield 

Xr   =    Sr   -    (Sr   -   X.o)e-«^^        (10) 

which  is  identical  in  form  with  equa- 
tion (2). 

The  other  component  oi  x,  {x  —  Xr), 
will  decrease  as  these  elements  become 
conditioned  to  the  competing  response 
class,  Re,  according  to  the  following 
relations. 

d(x  —  Xr)  —  S(S  —  Sr)     (X    —   Xr) 


dTe  S  {S-  Sr) 

=  —  q{x  —  Xr)  (11) 

and  the  integral, 

X   —  Xr   =    (Xo   —  Xro)e~^'^^.        (12) 

It  will  be  observed  that  an  analogous 
set  of  equations  could  be  written  for 
changes  in  the  number  of  elements  con- 
ditioned to  Re,  and  that  the  argument 
could  be  extended  to  any  number  of 
mutually  exclusive  classes  of  responses. 
From  these  relations  it  is  not  difficult 
to  deduce  differential  equations  which 
may  be  at  least  numerically  integrated 
to  yield  curves  giving  probability  of 
occurrence  of  each  response  class  as  a 
function  of  number  of  reinforcements. 
We  shall  not  carry  out  the  derivations 
here,  but  shall  point  out  a  number  of 
properties  of  the  curves  obtained  which 
will  be  evident  from  inspection  of  equa- 
tions (10)  and  (12). 

1.  Regardless  of  the  initial  probabili- 
ties, the  behavior  system  will  tend  to 
a  state  of  equilibrium  in  which  the  final 
mean  probability  of  R  will  be  Sr/S 
and  the  final  mean  probability  of  Re  will 
be  (5  -  Sr)/S. 

2.  If  the  number  of  elements  from  5 
conditioned  to  R  at   the  start   of  an 


experiment  is  greater  than  Sr,  the  prob- 
ability of  jR  will  decrease  until  the 
equilibrium  is  reached.  (Of  course  all 
statements  made  here  about  R  have 
analogues  for  Re.) 

3.  If  the  number  of  elements  from 
5  conditioned  to  R  at  the  start  of  an 
experiment  is  less  than  Sr,  the  prob- 
ability of  R  will  increase  until  the  equi- 
librium value  is  reached. 

4.  If  all  elements  originally  condi- 
tioned to  R  belong  to  the  sub-population 
Sr,  then  the  curve  relating  probability 
to  number  of  reinforcements  will  be 
identical  with  equation  (3')  except  for 
the  as3rmptote,  which  will  be  Sr/S 
rather  than  unity. 

5.  If  some  of  the  elements  originally 
conditioned  to  R  do  not  belong  to  Sr, 
but  Xq  is  less  than  Sr,  then  the  curve 
relating  probability  to  number  of  rein- 
forcements will  rise  less  steeply  at  first 
than  equation  (3'),  and  may  even  have 
an  initial  positively  accelerated  limb. 

It  will  be  noted  that  from  the  present 
point  of  view,  conditioning  and  extinc- 
tion are  regarded  simply  as  two  aspects 
of  a  single  process.  In  practice  we 
categorize  a  given  experiment  as  a 
study  of  conditioning  or  a  study  of 
extinction  depending  upon  which  be- 
haviors are  being  recorded.  It  seems 
quite  possible  that  both  conditioning 
and  extinction  always  occur  concur- 
rently in  any  behavior  system,  and  that 
the  common  practice  of  regarding  them 
as  separate  processes  is  based  more  on 
tradition  and  the  limitations  of  record- 
ing apparatus  than  upon  rational  con- 
siderations. In  the  present  formula- 
tion, reinforcement  is  treated  as  a  quan- 
titatively graded  variable  with  "pure 
extinction"  at  one  end  of  a  continuum. 
Any  portion  of  an  5-population  may  be 
related  to  an  jR-class  by  experimental 
conditions  which  produce  a  correlation 
between  stimulus  sampling  and  R- 
occurrences.  Under  given  conditions 
of   reinforcement   an    7^-class   may   in- 
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crease  or  decrease  in  probability  of  oc- 
currence over  a  series  of  trials  depend- 
ing upon  whether  the  momentary  prob- 
ability is  less  than  or  greater  than  the 
equilibrium  value  for  those  conditions. 

Discussion 

The  foregoing  sections  will  suffice  to 
illustrate  the  manner  in  which  problems 
of  learning  may  be  handled  within  the 
framework  of  a  statistical  theory.  The 
extent  to  which  the  formal  system  de- 
veloped here  may  be  fruitfully  applied 
to  interpret  experimental  phenomena 
can  only  be  answered  by  a  considerable 
program  of  research.  A  study  of  con- 
curi-ent  conditioning  and  extinction  of 
simple  skeletal  responses  which  realizes 
quite  closely  the  simplified  conditions 
assumed  in  the  derivations  of  the  pres- 
ent paper  has  been  completed,  and  a 
report  is  now  in  press.  Other  papers 
in  preparation  will  apply  this  formula- 
tion to  extinction,  spontaneous  recovery, 
discrimination,  and  related  phenomena. 

The  relation  of  this  program  to  con- 
temporary theories  of  learning  requires 
little  comment.  No  attempt  has  been 
made  to  present  a  "new"  theory.  It 
is  the  purpose  of  our  investigation  to 
clarify  some  of  the  conceptions  of  learn- 
ing and  discrimination  by  stating  im- 
portant concepts  in  quantitative  form 
and  investigating  their  interrelation- 
ships by  mathematical  analysis.  Many 
similarities  will  be  noted  between  func- 
tions developed  here  and  "homologous" 
expressions  in  the  quantitative  formu- 
lations of  Graham  and  Gagne  (3)  and 
of  Hull  (6).  A  thorough  study  of  those 
theories  has  influenced  the  writer's 
thinking  in  many  respects.  Rather 
than  build  directly  on  either  of  those 
formulations,  I  have  felt  it  desirable  to 
explore  an  alternative  point  of  view 
based  on  a  statistical  definition  of  en- 
vironment and  behavior  and  doing 
greater  justice  to  the  theoretical  views 
of  Skinner  and  Guthrie.    A  statistical 


theory  seems  to  be  an  inevitable  de- 
velopment at  the  present  stage  of  the 
science  of  behavior;  agreement  on  this 
point  may  be  found  among  writers  of 
otherwise  widely  diverse  viewpoints, 
e.g.,  Brunswik  (1),  Hoagland  (5), 
Skinner  (7),  and  Wiener  (9).  It  is  to 
be  expected  that  with  increasing  rigor 
of  definition  and  continued  interplay 
between  theory  and  experiment,  the 
various  formulations  of  learning  will 
tend  to  converge  upon  a  common  set  of 
concepts. 

It  may  be  helpful  to  outline  briefly 
the  point  of  view  on  certain  contro- 
versial issues  implied  by  the  present 
analysis. 

Stimulus-response  terminology.  An 
attempt  has  been  made  to  overcome 
some  of  the  rigidity  and  oversimplifica- 
tion of  traditional  stimulus-response 
theory  without  abandoning  its  principal 
advantages.  We  have  adopted  a  defini- 
tion of  stimulus  and  response  similar  to 
Skinner's  (7)  concept  of  generic  classes, 
and  have  given  it  a  statistical  interpre- 
tation. Laws  of  learning  developed 
within  this  framework  refer  to  behavior 
systems  (as  defined  in  the  introductory 
section  of  this  paper)  rather  than  to 
relations  between  isolated  stimulus- 
response  correlations. 

The  learning  curve.  This  investi- 
gation is  not  intended  to  be  another 
search  for  "the  learning  function." 
The  writer  does  not  believe  that  any 
simple  function  will  be  found  to  ac- 
count for  learning  independently  of 
particular  experimental  conditions.  On 
the  other  hand,  it  does  seem  quite  pos- 
sible that  from  a  relatively  small  set  of 
definitions  and  assumptions  we  may  be 
able  to  derive  expressions  describing 
learning  under  various  specific  experi- 
mental arrangements. 

Measures  of  behavior.  Likelihood  of 
responding  has  been  taken  as  the  pri- 
mary dependent  variable.  Analyses  pre- 
sented above  indicate  that  simple  rela- 
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tions  can  be  derived  between  proba- 
bility and  such  common  experimentally 
obtained  measures  as  rate  of  responding 
and  latency. 

Laws  of  contiguity  and  effect.  Avail- 
able experimental  evidence  on  simple 
learning  has  seemed  to  the  writer  to 
require  the  assumption  that  temporal 
contiguity  of  stimuli  and  behavior  is  a 
necessary  condition  for  the  formation 
of  conditional  relations.  At  the  level 
of  differential  analysis,  that  is  of  laws 
relating  momentary  changes  in  behav- 
ior to  changes  in  independent  variables, 
no  other  assumption  has  proved  neces- 
sary at  the  present  stage  of  the  investi- 
gation. In  order  to  account  for  the 
accumulation  of  conditional  relations  in 
favor  of  one  i<!-class  at  the  expense  of 
others  in  any  situation,  we  have  ap- 
pealed to  a  group  of  experimental  op- 
erations which  are  usually  subsumed 
under  the  term  "reinforcement"  in  cur- 
rent experimental  literature.  Both 
Guthrie's  (4)  verbal  analyses  and  the 
writer's  mathematical  investigations  in- 
dicate that  an  essential  property  of  re- 
inforcement is  that  it  ensures  that  suc- 
cessive occurrences  of  a  given  R  will  be 
contiguous  with  different  samples  from 
the  available  population  of  stimuli.  We 
have  made  no  assumptions  concerning 
the  role  of  special  properties  of  certain 
after-effects  of  responses,  such  as  drive- 
reduction,  changes  in  affective  tone,  etc. 
Thus  the  quantitative  relations  devel- 
oped here  may  prove  useful  to  investi- 
gators of  learning  phenomena  regardless 
of  the  investigators'  beliefs  as  to  the 
nature  of  underlying  processes. 

Summary 

An  attempt  has  been  made  to  clarify 
some  issues  in  current  learning  theory 
by  giving  a  statistical  interpretation  to 
the  concepts  of  stimulus  and  response 
and  by  deriving  quantitative  laws  that 
govern  simple  behavior  systems.  De- 
pendent variables,  in  this  formulation, 
are  classes   of   behavior   samples   with 


common  quantitative  properties;  inde- 
pendent variables  are  statistical  dis- 
tributions of  environmental  events. 
Laws  of  the  theory  state  probability 
relations  between  momentary  changes  in 
behavioral  and  environmental  variables. 

From  this  point  of  view  it  has  been 
possible  to  derive  simple  relations  be- 
tween probability  of  response  and  sev- 
eral commonly  used  measures  of  learn- 
ing, and  to  develop  mathematical  ex- 
pressions describing  learning  in  both 
classical  conditioning  and  instrumental 
learning  situations  under  simplified  con- 
ditions. 

No  effort  has  been  made  to  defend 
the  assumptions  underlying  this  formu- 
lation by  verbal  analyses  of  what 
"really"  happens  inside  the  organism 
or  similar  arguments.  It  is  proposed 
that  the  theory  be  evaluated  solely  by 
its  fruitfulness  in  generating  quantita- 
tive functions  relating  various  phenom- 
ena of  learning  and  discrimination. 
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STATISTICAL  THEORY  OF  SPONTANEOUS  RECOVERY 
AND  REGRESSION 

W.  K.  ESTES 1 
Indiana  University 


From  the  viewpoint  of  one  interested 
in  constructing  a  learning  theory,  it 
would  be  convenient  if  an  organism's 
habits  of  responding  with  respect  to 
any  given  situation  were  modifiable 
only  during  periods  of  exposure  to  the 
situation.  In  that  case,  it  would  not 
be  unreasonable,  prima  facie,  to  hope 
that  all  of  the  empirical  laws  of  learn- 
ing could  be  stated  in  terms  of  rela- 
tions between  behavioral  and  environ- 
mental variables.  Nothing  in  psychol- 
ogy is  much  more  certain,  however, 
than  that  orderly  changes  in  response 
tendencies — e.g.,  spontaneous  recovery, 
forgetting — do  occur  during  intervals 
when  the  organism  and  the  situation 
are  well  separated. 

How  are  these  "spontaneous"  changes 
to  be  accounted  for?  It  is  easy  enough 
to  construct  a  law  expressing  some  be- 
havioral measure  as  a  function  of  time, 
but  an  unfilled  temporal  interval  never 
remains  permanently  satisfying  as  an 
explanatory  variable.  The  temporal 
gap  has  to  be  filled  with  events  of  some 
sort,  observed  or  inferred,  in  the  envi- 
ronment or  in  the  organism.  The  fa- 
vorite candidate  for  the  intervening  po- 
sition has  usually  been  a  postulated 
state  or  process,  either  neural  or  purely 
hypothetical,    which    varies    spontane- 

1  This  paper  was  prepared  during  the  au- 
thor's tenure  as  a  faculty  research  fellow  of 
the  Social  Science  Research  Council. 


ously  during  rest  intervals  in  whatever 
manner  is  required  to  account  for  the 
behavioral  changes.  The  difficulty  with 
this  type  of  construct  is  that  it  is  al- 
ways much  easier  to  postulate  than  to 
unpostulate.  Few  hypothetical  entities 
are  so  ill-favored  that  once  having  se- 
cured a  foothold  they  cannot  face  out 
each  new  turn  of  empirical  events  with 
the  aid  of  a  few  ad  hoc  assumptions. 

The  approach  to  time-dependent  learn- 
ing phenomena  which  will  be  illustrated 
in  this  paper  attempts  to  shift  the  bur- 
den of  explanation  from  hypothesized 
processes  in  the  organism  to  statistical 
properties  of  environmental  events.  The 
very  extensiveness  of  the  array  of  hy- 
pothetical constructs — e.g.,  set,  reactive 
inhibition,  memory  trace — which  now 
compete  for  attention  in  this  area  sug- 
gests that  postulates  of  this  type  have 
entered  the  scene  prematurely.  Until 
more  parsimonious  explanatory  vari- 
ables have  been  fully  explored,  it  will 
scarcely  be  possible  either  to  define 
clearly  the  class  of  problems  which  re- 
quire explanation  or  to  evaluate  the 
various  special  hypotheses  that  have 
been  proposed. 

By  "more  parsimonious"  sources  of 
explanation,  I  refer  to  the  variables, 
ordinarily  stimulus  variables,  which  are 
intrinsic  to  a  given  type  of  behavioral 
situation  and  thus  must  be  expected  to 
play  a  role  in  any  interpretive  schema. 


This  article   appeared   in  Psychol.  Rev.,1955,  62,  145-154.     Reprinted  with  permission. 
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In  the  present  instance  we  are  inter- 
ested specifically  in  the  way  learned 
response  tendencies  change  during  rest 
intervals  following  experimental  peri- 
ods. And  we  note  that  there  are  two 
principal  ways  in  which  stimulus  vari- 
ables could  lead  to  modification  in  re- 
sponse tendencies  during  rest  intervals. 
The  first  is  the  direct  effect  that  changes 
in  the  stimulus  characteristics  of  ex- 
perimental situations  from  trial  to  trial 
or  period  to  period  may  have  upon  re- 
sponse probability.  The  second  is  the 
learning  that  may  occur  between  peri- 
ods if  the  stimulating  situations  obtain- 
ing within  and  between  periods  have 
elements  in  common.  The  former  cate- 
gory can  again  be  subdivided  according 
as  the  environmental  variation  is  sys- 
tematic or  random. 

The  random  component  has  been  se- 
lected as  our  first  subject  of  investiga- 
tion for  several  reasons.  One  is  that  it 
has  received  little  attention  heretofore 
in  learning  theory.  Another  is  that  in 
other  sciences  apparently  spontaneous 
changes  in  observables  have  frequently 
turned  out  to  be  attributable  to  random 
processes  at  a  more  molecular  level. 
Perhaps  not  surprisingly,  considerable 
analysis  has  been  needed  in  order  to 
ascertain  how  random  environmental 
fluctuations  during  intervals  of  rest 
following  learning  periods  would  be  ex- 
pected to  influence  response  probabili- 
ties. It  will  require  the  remainder  of 
this  paper  to  summarize  the  methods 
and  results  of  this  one  phase  of  the 
over-all  investigation. 

General  Theory  of  Stimulus 
Fluctuation 

Even  prior  to  a  detailed  analysis,  we 
can  anticipate  that  whenever  environ- 
mental fluctuation  occurs,  the  prob- 
ability of  a  response  at  the  end  of  one 
experimental  period  will  not  be  the 
same  as  the  probability  at  the  begin- 
ning of   the  next.     If  conditioning   is 


carried  out  during  a  given  period,  some 
of  the  newly  conditioned  stimulus  ele- 
ments ^  will  be  replaced  before  the  next 
period  by  elements  which  have  not 
previously  been  available  for  condition- 
ing. Similarly,  during  the  interval  fol- 
lowing an  extinction  period,  random 
fluctuation  will  lead  to  the  replacement 
of  some  of  the  just  extinguished  stimu- 
lus elements  by  others  which  were  sam- 
pled during  conditioning  but  have  not 
been  available  during  extinction.  In 
either  case,  the  result  will  be  a  pro- 
gressive change  in  response  probability 
as  a  function  of  duration  of  the  rest 
interval. 

In  order  to  make  these  ideas  testable, 
we  must  state  more  formally  and  ex- 
plicitly the  concepts  and  assumptions 
involved.  Once  this  is  done,  we  will 
have  in  effect  a  fragmentary  theory,  or 
model,  which  may  account  for  certain 
apparently  spontaneous  changes  in  re- 
sponse tendencies.  At  a  minimum,  this 
formal  model  will  enable  us  to  derive 
the  logical  consequences  of  the  concept 
of  random  environmental  fluctuation  so 
that  they  may  be  tested  against  experi- 
mental data.  If  the  correspondence 
turns  out  to  be  good,  we  may  wish  to 
incorporate  this  model  into  the  concep- 
tual structure  of  S-R  learning  theory, 
viewing  it  as  a  limited  theory  which  ac- 
counts for  a  specific  class  of  time-de- 
pendent phenomena. 

Most  of  the  assumptions  we  shall  re- 
quire have  been  discussed  elsewhere 
(8)  and  need  only  be  restated  briefly 
for  our  present  purposes. 

a.  Any  environmental  situation,  as 
constituted  at  a  given  time,  determines 
for  a  given  organism  a  population  of 

2  For  reasons  of  mathematical  simplicity 
and  convenience  I  shall  develop  these  ideas 
in  terms  of  the  concepts  of  statistical  learn- 
ing theory.  It  will  be  apparent,  however, 
that  within  the  Hullian  system  a  similar 
argument  could  be  worked  out  in  terms  of 
the  fluctuation  of  stimuli  along  generahzation 
continua. 


324 


READINGS  IN   MATHEMATICAL   PSYCHOLOGY 


stimulus  events  from  which  a  sample 
affects  the  organism's  behavior  at  any 
instant;  in  statistical  learning  theories 
the  population  is  conceptualized  as  a 
set  of  stimulus  elements  from  which  a 
random  sample  is  drawn  on  each  trial. 

b.  Conditioning  and  extinction  occur 
only  with  respect  to  the  elements  sam- 
pled on  a  trial. 

c.  The  behaviors  available  to  an  or- 
ganism in  a  given  situation  may  be 
categorized  into  mutually  exclusive  and 
exhaustive  response  classes. 

d.  At  any  time,  each  stimulus  ele- 
ment in  the  population  is  conditioned 
to  exactly  one  of  these  response  classes. 

On  the  basis  of  these  assumptions, 
functions  have  been  derived  by  various 
investigators  (2,  5,  8,  16,  21)  to  de- 
scribe the  course  of  learning  predicted 
for  an  idealized  situation  in  which  the 
physical  environment  is  perfectly  con- 
stant and  the  organism  samples  the 
stimulus  population  on  each  trial.  No 
idealized  situations  are  available  for 
testing  purposes,  but  the  theory  seems 
to  give  good  approximations  to  em- 
pirical learning  functions  obtained  in 
short  experimental  periods  under  well- 
controlled  conditions. 

In  the  present  paper  we  turn  our  at- 
tention from  behavioral  changes  that 
occur  within  experimental  periods  to 
the  changes  that  occur  as  a  function  of 
the  intervals  between  periods.  Corre- 
spondingly, we  replace  the  simplifying 
assumption  of  a  perfectly  constant 
situation  with  the  assumption  of  a  ran- 
domly fluctuating  situation.^  Specifi- 
cally, it  will  be  assumed  that  the  avail- 
ability of  stimulus  elements  during  a 
given  learning  period  depends  upon  a 
large  number  of  independently  variable 
components  or  aspects  of  the  environ- 

8  It  is  possible  now  to  go  back  and  "cor- 
rect" the  functions  derived  earlier  to  allow 
for  this  random  variation,  but  we  wUl  not 
be  able  to  go  into  this  point  in  the  present 
paper. 


mental  situation,  all  of  which  undergo 
constant  random  fluctuation. 

Now  let  us  consider  the  type  of  ex- 
periment in  which  an  organism  is  run 
for  more  than  one  period  in  the  same 
apparatus.  In  dealing  with  the  behav- 
ior that  occurs  during  any  given  experi- 
mental period,  the  total  population  S* 
of  stimulus  elements  available  in  the 
situation  at  any  time  during  the  ex- 
periment can  be  partitioned  into  two 
portions:  the  subset  5  of  elements 
which  are  available  during  that  period 
and  the  subset  5'  of  elements  which 
are  not.  Under  the  conditions  consid- 
ered in  this  paper,  the  probability  of  a 
response  at  any  given  time  during  the 
period  is  equal  to  the  proportion  of  ele- 
ments in  the  available  set  S  that  are 
conditioned  to  that  response.  Owing  to 
environmental  fluctuation,  there  is  some 
probability  ;  that  an  element  in  the 
available  set  5  will  become  unavailable, 
i.e.,  go  into  S',  during  any  given  in- 
terval A^,  and  a  probability  ;'  that  an 
element  in  S'  will  enter  5.  These  ideas 
are  illustrated  in  Fig.  1  for  a  hypotheti- 
cal situation. 
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Fig.  1.  Fluctuations  in  stimulus  sets  dur- 
ing spontaneous  regression  (upper  panel)  and 
spontaneous  recovery  from  extinction  (lower 
panel).  Circles  represent  elements  connected 
to  response  A.  Values  of  p  represent  prob- 
abilities of  response  A  in  the  available  set  S. 
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The  relevance  of  the  scheme  for 
learning  phenomena  arises  from  the 
fact  that  both  conditioned  and  uncon- 
ditioned elements  will  constantly  be 
fluctuating  in  and  out  of  the  available 
set  S.  During  an  experimental  period 
in  which  conditioning  or  extinction  oc- 
curs, the  proportion  of  conditioned  ele- 
ments in  S  will  increase  or  decrease 
relative  to  the  proportion  in  5'.  But 
during  a  subsequent  rest  interval,  these 
proportions  will  tend  toward  equality  as 
a  result  of  the  fluctuation  process. 

Interpretation  of  Spontaneous 
Recovery  and  Regression  * 

The  essentials  of  our  treatment  of 
spontaneous  recovery  and  regression 
will  be  clear  from  an  inspection  of 
Fig.  1.  The  upper  panel  illustrates  a 
case  in  which,  starting  from  a  zero 
level,  conditioning  of  a  given  response 
A  is  carried  out  during  one  period  until 
the  probability  of  A  in  the  available 
situation  represented  by  the  set  S  is 
unity.  At  the  end  of  the  conditioning 
period  we  will  have,  neglecting  any 
fluctuation  that  may  have  occurred 
during  the  period,  all  of  the  elements 
in  5  conditioned  to  A  and  all  of  the 
temporarily  unavailable  elements  in  S' 
unconditioned.  During  the  first  inter- 
val At  of  the  ensuing  rest  interval,  the 
proportion  }  =  .6  of  the  conditioned 
elements  will  escape  from  5,  being  re- 
placed by  the  proportion  ;'  =  .2  of  the 
unconditioned  elements  from  5'.  Dur- 
ing further  intervals  the  interchange 
will  continue,  at  a  progressively  de- 
creasing rate,  until  the  system  arrives 
at  the  final  state  of  statistical  equilib- 

*  The  term  spontaneous  regression  will  be 
used  here  to  refer  to  any  decrease  in  response 
probability  which  is  attributable  solely  to 
stimulus  fluctuation.  It  is  assumed  that  over 
short  time  intervals,  the  empirical  phenome- 
non of  forgetting  may  be  virtually  identified 
with  regression,  but  that  over  longer  intervals 
forgetting  is  influenced  to  an  increasing  extent 
by  effects  of  interpolated  learning. 


rium  in  which  the  densities  of  condi- 
tioned elements  in  5  and  S'  are  equal. 
The  predicted  course  of  spontaneous 
regression  in  terms  of  the  proportion 
of  conditioned  elements  that  will  be  in 
S  at  any  time  following  the  condition- 
ing period  is  given  by  the  topmost 
curve  in  the  upper  panel  of  Fig.  2. 
The  equation  of  the  curve  will  be  de- 
rived in  a  later  section. 

In  an  analogous  fashion  the  essentials 
of  the  spontaneous  recovery  process  are 
schematized  in  the  lower  panel  of  Fig. 
1.  We  begin  at  the  left  with  a  situa- 
tion following  maximal  conditioning  so 
that  all  elements  are  conditioned  to  re- 
sponse A.  During  a  single  period  of 
extinction,  all  elements  in  the  available 
set  5  are  conditioned  to  the  class  of 
competing  responses  A  and  the  prob- 
ability of  A  goes  temporarily  to  zero. 
Then  during  a  recovery  interval,  the 
random  interchange  of  conditioned  and 
unconditioned  elements  between  5  and 
5'  results  in  a  gradual  increase  in  the 
proportion  of  conditioned  elements  in 
5  until  the  final  equilibrium  state  is 
reached.  The  predicted  course  of  spon- 
taneous recovery  as  a  function  of  time 
is  given  by  the  topmost  curve  in  the 
lower  panel  of  Fig.  3. 

According  to  this  analysis,  spontane- 
ous regression  and  recovery  are  to  be 
regarded  as  two  aspects  of  the  same 
process.  In  each  case  the  form  of  the 
process  is  given  by  a  negatively  ac- 
celerated curve  with  the  relative  rate  of 
change  depending  solely  upon  the  char- 
acteristics of  the  physical  situation  em- 
bodied in  the  parameters  ;  and ;'.  Rates 
of  regression  and  recovery  should,  then, 
vary  together  whenever  the  variability 
of  the  stimulating  situation  is  modified. 

It  cannot  be  assumed,  however,  that 
amounts  of  regression  and  recovery 
should  be  equal  and  opposite  in  all  ex- 
periments. The  illustrative  example  of 
Fig.  1  meets  two  special  conditions  that 
do  not  always  hold:   (a)  the  condition- 


326 


READINGS  IN   MATHEMATICAL  PSYCHOLOGY 


Pit) 


\                       (Kt)' 

Pto) 

(.25t(.75).2') 

\^      ^^-...^'-e? 

'^^ ^P(o)'.33 

j_ 

Fig.  2.  Families  of  spontaneous  regression 
curves.  In  the  upper  panel  the  proportion  of 
conditioned  elements  in  5'  at  the  end  of  con- 
ditioning is  zero  and  the  proportion  in  5  is 
the  parameter.  In  the  lower  panel  the  pro- 
portion of  conditioned  elements  in  5  at  the 
end  of  conditioning  is  unity  and  the  propor- 
tion in  S'  is  the  parameter. 

ing  and  extinction  series  start  from 
initial  response  probabilities  of  zero 
and  unity,  respectively;  and  (b)  con- 
ditioning and  extinction  are  carried  to 
comparable  criteria  within  the  experi- 
mental period  preceding  the  rest  in- 
terval. 

Predictions  Concerning  Effects 
OF  Experimental  Variables 

Terminal  level  of  conditioning  or  ex- 
tinction. If  other  conditions  remain 
fixed,  the  level  of  response  probability 
attained  at  the  end  of  a  single  learning 
period  will  determine  both  the  initial 
value  and  the  as3miptote  of  the  curve 
of  regression  or  recovery.  For  the 
situation  represented  by  the  upper 
panel  of  Fig.  1,  the  curve  of  condition- 
ing goes  to  unity,  and  the  predicted 
course    of    spontaneous    regression    is 


given  by  the  top  curve  in  the  upper 
panel  of  Fig.  2.  If  in  the  same  situa- 
tion, conditioning  has  been  carried  only 
to  a  probability  level  of,  say,  .67,  then 
the  total  number  of  conditioned  ele- 
ments will  be  smaller  and  the  curve  of 
regression  will  not  only  start  at  a  lower 
value,  but  will  run  to  a  lower  asymp- 
tote, and  so  on.  Similarly,  if  in  the 
situation  represented  by  the  lower  panel 
of  Fig.  1,  response  probability  goes  to 
zero  during  the  extinction  period,  the 
predicted  course  of  spontaneous  re- 
covery is  given  by  the  lowest  curve  in 
the  upper  panel  of  Fig.  3;  if  extinction 
terminates  at  higher  probability  levels, 
we  obtain  the  successively  higher  re- 
covery curves  shown  in  the  figure. 

Number  of  preceding  learning  peri- 
ods. Increasing  the  number  of  preced- 
ing acquisition  periods  would  tend  to 
increase    the    total    number    of    condi- 


FiG.  3.  Families  of  spontaneous  recovery 
curves.  In  the  upper  panel  the  proportion  of 
conditioned  elements  in  5'  at  the  end  of  ex- 
tinction is  unity  and  the  proportion  of  con- 
ditioned elements  in  5  at  the  end  of  extinc- 
tion is  the  parameter.  In  the  lower  panel, 
the  proportion  of  conditioned  elements  in  5' 
at  the  end  of  extinction  is  the  parameter  and 
the  proportion  in  5  is  zero. 
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tioned  elements  in  5*  and  therefore  the 
asymptote  of  the  curve  of  regression. 
If  level  of  response  probability  at  the 
end  of  the  last  acquisition  period  is 
fixed  at  some  one  value,  say  unity,  then 
variation  in  the  proportion  of  condi- 
tioned elements  in  S'  yields  the  family 
of  regression  curves  illustrated  in  the 
lower  panel  of  Fig.  2,  all  curves  start- 
ing at  the  same  point  but  diverging  to 
different  as5miptotes.  This  curve  family 
will  be  recognized  as  corresponding  to 
the  well-known  relationship  between  re- 
tention and  amount  of  overlearning, 
where  overlearning  is  defined  in  terms 
of  additional  training  beyond  the  point 
at  which  response  probability  in  the 
temporarily  available  situation  reaches 
unity. 

Analogous  considerations  apply  in  the 
case  of  spontaneous  recovery.  Increas- 
ing the  number  of  preceding  extinction 
periods  would  tend  to  decrease  the  pro- 
portion of  conditioned  elements  remain- 
ing in  y  at  the  end  of  extinction  and 
thus  the  as)miptote  of  the  curve  of 
spontaneous  recovery,  as  illustrated  in 
the  lower  panel  of  Fig.  3.  On  the  other 
hand,  increasing  the  number  of  con- 
ditioning periods  prior  to  extinction 
would  tend  to  increase  the  density  of 
conditioned  elements  in  S'  and  thus  the 
asymptote  of  the  curve  of  recovery  fol- 
lowing a  period  of  extinction. 

The  experimental  phenomenon  of 
"extinction  below  zero"  corresponds  to 
a  case  in  which  additional  extinction 
trials  are  given  beyond  the  point  at 
which  temporary  response  probability 
first  reaches  zero.  The  results  of  this 
procedure  will  clearly  depend  upon  the 
conditioning  history.  Consider,  for  ex- 
ample, the  situation  illustrated  in  the 
top  row  of  Fig.  1.  If  extinction  were 
begun  immediately  following  the  condi- 
tioning period,  then  we  would  expect 
extinction  below  zero  to  have  little  ef- 
fect, for  at  the  end  of  the  first  extinc- 
tion  period   the   set   5   would   be   ex- 


hausted of  conditioned  elements  and 
there  would  be  few  or  none  in  S'  to 
fluctuate  back  into  S  during  further  pe- 
riods of  extinction.  If,  however,  ex- 
tinction began  long  enough  after  the 
end  of  the  acquisition  period  so  that  an 
appreciable  number  of  conditioned  ele- 
ments were  in  S'  during  the  first  ex- 
tinction period,  the  additional  extinc- 
tion would  further  reduce  the  total 
number  of  conditioned  elements  and 
thus  increase  the  amount  of  training 
that  would  be  required  for  recondi- 
tioning. If  conditioning  extended  over 
more  than  one  period,  then  there  would 
be  conditioned  elements  in  S'  at  the 
end  of  conditioning,  and  similar  effects 
of  extinction  below  zero  would  be  ex- 
pected even  if  extinction  began  im- 
mediately after  the  last  conditioning 
period. 

Distribution  of  practice.  In  gen- 
eral, amount  of  spontaneous  regression 
should  vary  inversely  with  duration  of 
the  intertrial  interval  during  condition- 
ing, and  spontaneous  recovery  should 
vary  inversely  with  duration  of  the  in- 
tertrial interval  during  extinction.  In 
each  case,  the  length  of  the  intertrial 
interval  will  determine  the  extent  to 
which  the  stimulating  situation  can 
change  between  trials,  and  thus  the 
proportion  of  the  elements  in  the  stimu- 
lus population  5^^  which  will  be  sam- 
pled during  a  given  number  of  trials. 
These  relationships  will  be  treated  in 
more  detail  in  a  forthcoming  paper  (7). 

Mathematical  Development  of 
Fluctuation  Theory 

Stimulus  fluctuation  model.  Let  the 
probability  that  any  given  element  of 
a  total  set  5*  is  in  the  available  set  S 
at  time  t  be  represented  by  f(t),  the 
probability  that  an  element  in  5  es- 
capes into  the  unavailable  set  5'  during 
a  time  interval  A.t  by  ;*,  and  the  prob- 
ability that  an  element  in  5'  enters  S 
during  an  interval  A^  by  /.     Then  by 
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elementary  probability  theory  we  have 
for  the  probability  that  an  element  is 
in  5  at  the  end  of  the  (^  +  l)st  inter- 
val A*  following  an  experimental  pe- 
riod: 

/(/  +  1)  =  [!-/(/)]/ +/(0(1  -J). 

This  difference  equation  can  be  solved 
by  standard  methods  (2,  12)  to  yield 
a  formula  for  /(^  in  terms  of  t  and 
the  parameters: 

=  /-C/-/(0)]a'  [1] 

where  /(O)  is  the  initial  value  of  /(/) ; 
/  represents  the  fraction  ;'/;  +  j' ;  and 
a  represents  the  quantity  (1  —  '}—']')• 
Since  a  is  bounded  between  —  1  and 
+  1  by  the  definition  of  ;  and  /,  the 
probability  that  any  element"  is  in  S 
will  settle  down  to  the  constant  value  / 
after  a  sufficiently  long  interval  of  time, 
and  the  total  numbers  of  elements  in  5 
and  S'  will  stabilize  at  mean  values  N 
and  N',  respectively,  which  satisfy  the 
relation. 


N  =  J(N  -f  N'). 


[2] 


Spontaneous  recovery  and  regression. 
Curves  of  spontaneous  recovery  and  re- 
gression can  now  be  obtained  by  ap- 
propriate  application   of    Equation    1, 

^  For  simplicity,  it  has  been  assumed  in  this 
paper  that  all  of  the  elements  in  5*  have  the 
same  values  of  ;  and  ;'.  In  dealing  with 
some  situations  it  might  be  more  reasonable 
to  assume  that  different  parameter  values  are 
associated  with  different  elements.  For  ex- 
ample, data  obtained  by  Homme  (11)  sug- 
gest that  in  the  Skinner  box  a  portion  of  the 
elements  should  be  regarded  as  fixed  and  al- 
ways available  while  the  remainder  fluctuate. 
Application  of  an  analytic  method  described 
elsewhere  (8)  shows  that  conclusions  in  the 
general  case  will  differ  only  quantitatively 
from  those  given  in  this  paper. 


Let  us  designate  by  p(t)  and  p'(t) 
the  proportions  of  conditioned  elements, 
and  therefore  the  response  probabilities, 
in  5  and  5'  respectively  at  time  t  fol- 
lowing an  experimental  period.  The  set 
of  conditioned  elements  in  5  at  time  t 
will  come  in  part  from  the  conditioned 
elements,  p{0)N  in  number,  that  were 
in  S  at  the  end  of  the  experimental  pe- 
riod, and  in  part  from  the  conditioned 
elements,  p'{0)N'  in  number,  that  were 
in  5'.  The  probabilities  of  finding  ele- 
ments from  these  two  sources  in  5  at 
time  t  are  obtained  from  Equation  1 
by  setting  /(O)  equal  to  1  and  0  re- 
spectively. With  these  relations  at 
hand  we  are  ready  to  write  the  general 
expression  for  spontaneous  recovery  and 
regression: 

p{t)=^\:piO){J-(J-l)a^}N 

+  p'(0)/(l  -a')N'2 

=  p{0)lJ  -  {J  -  l)aq 

-f-^'(0)(l-aO(l-/).    [3] 

the  parameters  N  and  N'  having  been 
eliminated  by  means  of  Equation  2. 

The  functions  illustrated  by  the  curve 
families  of  Fig.  2  and  3  are  all  special 
cases  of  Equation  3.  In  the  upper 
panel  of  Fig.  2,  p'{0)  has  been  set 
equal  to  0;  in  the  lower  panel,  />(0) 
has  been  set  equal  to  1.  In  the  upper 
panel  of  Fig.  3,  P'{0)  has  been  set 
equal  to  1;  in  the  lower  panel,  p{0) 
has  been  set  equal  to  0. 

Empirical  Relevance  and  Adequacy 

General  considerations.  The  theo- 
retical developments  of  the  preceding 
sections  present  two  aspects,  one  gen- 
eral and  one  specific,  which  are  by  no 
means  on  the  same  footing  with  regard 
to  testability.  It  will  be  necessary  to 
discuss  separately  the  general  concept 
of  stimulus  fluctuation  and  the  specific 
mathematical  model   utilized   for  pur- 
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poses  of  deriving  its  testable  conse- 
quences. 

The  reason  why  the  fluctuation  con- 
cept had  to  be  incorporated  into  a  for- 
mal theory  in  order  to  be  tested  was, 
of  course,  the  difficulty  of  direct  ob- 
servational check.  Thus  for  the  pres- 
ent this  concept  must  be  treated  with 
the  same  reserve  and  even  suspicion 
as  any  interpretation  which  appeals  to 
unobservable  events.  This  remoteness 
from  direct  observation  may,  however, 
represent  only  a  transitory  stage  in  the 
development  of  the  theory.  Relatively 
direct  attacks  upon  certain  aspects  of 
the  stimulus  element  concept  are  pro- 
vided by  recent  experiments  (1,  21)  in 
which  the  sampling  of  stimulus  popula- 
tions has  been  modified  experimentally 
and  the  outcome  compared  with  theo- 
retical expectation.  Further,  it  should 
be  noted  that  the  idea  of  stimulus 
fluctuation  is  well  grounded  in  physi- 
cal considerations.  Surely  no  one  would 
deny  that  stimulus  fluctuation  must  oc- 
cur continuously;  the  only  question  is 
whether  fluctuations  are  large  enough 
under  ordinary  experimental  conditions 
to  yield  detectable  effects  upon  behav- 
ior. The  surmise  that  they  are  is  not 
a  new  one;  the  idea  of  fluctuating  en- 
vironmental components  has  been  used 
in  an  explanatory  sense  by  a  number 
of  investigators  in  connection  with  par- 
ticular problems:  e.g.,  by  Pavlov  (19) 
and  Skinner  (22)  in  accounting  for 
perturbations  in  curves  of  conditioning 
or  extinction,  by  Guthrie  (10)  in  ac- 
counting for  the  effects  of  repetition, 
and  recently  by  Saltz  (20)  in  account- 
ing for  disinhibition  and  reminiscence. 

Considered  in  isolation,  the  concept 
of  stimulus  fluctuation  is  not  even  in- 
directly testable;  it  must  be  incorpo- 
rated into  some  broader  body  of  theory 
before  empirical  consequences  can  be 
derived.  In  the  present  paper  we  have 
found  that  when  this  concept  is  taken 
in  conjunction  with  other  concepts  and 


assumptions  common  to  contemporary 
statistical  learning  theories  (2,  5,  8, 
16),  the  result  of  the  union  is  a  mathe- 
matical model  which  yields  a  large  num- 
ber of  predictions  concerning  changes 
in  response  probability  during  rest  in- 
tervals. Once  formulated,  this  model 
is  readily  subject  to  experimental  test. 
Its  adequacy  as  a  descriptive  theory  of 
spontaneous  recovery  and  regression  can 
be  evaluated  quite  independently  of  the 
merits  of  the  underlying  idea  of  stimu- 
lus fluctuation. 

Spontaneous  recovery.  Space  does 
not  permit  the  detailed  discussion  of 
experimental  studies,  and  we  shall  have 
to  limit  ourselves  to  a  brief  summary  of 
empirical  relationships  derivable  from 
the  theory,  together  with  appropriate 
references  to  the  experimental  litera- 
ture. To  the  best  of  my  knowledge, 
the  references  cited  include  all  studies 
which  provide  quantitative  data  suit- 
able for  comparison  with  predicted 
functions. 

a.  The  curve  of  recovery  is  exponential  in 
form  (3,  9,  17)  with  the  slope  independent 
of  the  initial  value   (3). 

b.  The  asymptote  of  recovery  is  inversely 
related  to  the  degree  of  extinction  (3,  11). 

c.  The  asymptote  of  recovery  is  directly 
related  to  the  number  of  conditioning  peri- 
ods given  prior  to  extinction   (11). 

d.  The  asymptote  of  recovery  is  directly 
related  to  the  spacing  of  preceding  condition- 
ing periods  (11). 

e.  Amount  of  recovery  progressively  de- 
creases during  a  series  of  successive  extinc- 
tion periods  (4;  13;  19,  p.  61). 

It  may  be  noted  that  items  c  and  d 
represent  empirical  findings  growing  out 
of  a  study  conducted  expressly  to  test 
certain  aspects  of  the  theory.  Many 
additional  predictions  derivable  from 
the  theory  must  remain  unevaluated 
until  appropriate  experimental  evidence 
becomes  available,  e.g.,  the  inverse  re- 
lation between  as3miptote  of  recovery 
and  spacing  of  extinction  trials  or  peri- 
ods, and  the  predictions  concerning  "ex- 
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tinction   below  zero"   mentioned   in   a 
previous  section. 

Spontaneous  regression.  Predictions 
concerning  functional  relationships  be- 
tween spontaneous  regression  and  such 
experimental  variables  as  trial  spacing 
or  degree  of  learning  parallel  those 
given  above  for  spontaneous  recovery, 
but  in  the  case  of  regression  there  are 
fewer  data  available  for  purposes  of 
verification.  The  predicted  exponential 
decrease  in  amount  of  regression  as  a 
function  of  number  of  preceding  learn- 
ing periods  has  been  observed  in  sev- 
eral studies  (6,  11,  13,  14).  Predic- 
tions concerning  regression  in  relation 
to  spacing  of  learning  periods  have  not 
been  tested  in  conditioning  situations, 
but  they  seem  to  be  in  agreement  with 
rather  widely  established  empirical  re- 
lationships between  spacing  and  reten- 
tion in  human  learning  (15,  pp.  156- 
158;  18,  p.  508). 

Finally,  the  question  may  be  raised 
whether  there  are  no  experimental  facts 
that  would  embarrass  the  present  the- 
ory. If  a  claim  of  comprehensiveness 
had  been  made  for  the  theory,  then 
negative  instances  would  be  abundantly 
available.  Under  some  conditions,  for 
example,  recovery  or  regression  fails  to 
appear  at  all  following  extinction  or 
conditioning,  respectively.  Since,  how- 
ever, we  are  dealing  with  a  theory  that 
is  limited  to  effects  of  a  single  inde- 
pendent variable,  stimulus  fluctuation, 
instances  of  that  sort  are  of  no  special 
significance.  Like  any  limited  theory, 
this  one  can  be  tested  only  in  situa- 
tions where  suitable  measures  are  taken 
and  where  the  effects  of  variables  not 
represented  in  the  model  are  either 
negligible  or  else  quantitatively  pre- 
dictable. And  subject  to  these  qualifi- 
cations, available  evidence  seems  to  be 
uniformly  confirmatory.  The  danger 
of  continually  evading  negative  evi- 
dence by  ad  hoc  appeals  to  other  vari- 
ables cannot  be  entirely  obviated,  but 


it  may  be  progressively  reduced  if  we 
are  successful  in  bringing  other  relevant 
independent  variables  into  the  theoreti- 
cal fold  by  further  applications  of  the 
analytical  method  illustrated  here. 

Summary 

In  this  paper  we  have  investigated 
the  possibility  that  certain  apparently 
spontaneous  behavioral  changes,  e.g., 
recovery  from  extinction,  may  be  ac- 
counted for  in  terms  of  random  fluctua- 
tion in  stimulus  conditions.  Taken  in 
isolation,  the  concept  of  random  stimu- 
lus fluctuation  has  proved  untestable, 
but  when  incorporated  into  a  model  it 
has  led  to  quantitative  descriptions  of 
a  variety  of  already  established  em- 
pirical relationships  concerning  spon- 
taneous recovery  and  regression  and  to 
the  determination  of  some  new  ones. 
A  forthcoming  paper  in  which  the  same 
model  is  applied  to  the  problem  of  dis- 
tribution of  practice  will  provide  fur- 
ther evaluation  of  its  scope  and  useful- 
ness in  the  interpretation  of  learning 
phenomena. 
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A  THEORY  OF  STIMULUS  VARIABILITY  IN  LEARNING 


W.  K.  ESTES  AND  C.  J.  BURKE 

Indiana  University 


There  are  a  number  of  aspects  of  the 
stimulating  situation  in  learning  experi- 
ments that  are  recognized  as  important 
by  theorists  of  otherwise  diverse  view- 
points but  which  require  explicit  rep- 
resentation in  a  formal  model  for  ef- 
fective utilization.  One  may  find,  for 
example,  in  the  writings  of  Skinner, 
Hull,  and  Guthrie  clear  recognition  of 
the  statistical  character  of  the  stimulus 
concept.  All  conceive  a  stimulating 
situation  as  made  up  of  many  compo- 
nents which  vary  more  or  less  inde- 
pendently. From  this  locus  of  agree- 
ment, strategies  diverge.  Skinner  (17) 
incorporates  the  notion  of  variability 
into  his  stimulus-class  concept,  but 
makes  little  use  of  it  in  treating  data. 
Hull  states  the  concept  of  multiple  com- 
ponents explicitly  (13)  but  proceeds  to 
write  postulates  concerning  the  condi- 
tions of  learning  in  terms  of  single 
components,  leaving  a  gap  between  the 
formal  theory  and  experimentally  de- 
fined variables.  Guthrie  (11)  gives 
verbal  interpretations  of  various  phe- 
nomena, e.g.,  effects  of  repetition,  in 
terms  of  stimulus  variability;  these  in- 
terpretations generally  appear  plausi- 
ble but  they  have  not  gained  wide  ac- 
ceptance among  investigators  of  learn- 
ing, possibly  because  Guthrie's  assump- 
tions have  not  been  formalized  in  a 
way  that  would  make  them  easily  used 

1  This  paper  is  based  upon  a  paper  reported 
by  the  writers  at  the  Boston  meetings  of  the 
Institute  of  Mathematical  Statistics  in  Decem- 
ber 1951.  The  writers'  thinking  along  these 
and  related  lines  has  been  stimulated  and  their 
research  has  been  facDitated  by  participation 
in  an  interuniversity  seminar  in  mathematical 
models  for  behavior  theory  which  met  at 
Tufts  College  during  the  summer  of  19S1  and 
was  sponsored  by  SSRC. 


by  others.  Statistical  theories  of  learn- 
ing differ  from  Hull  in  making  stimulus 
variability  a  central  concept  to  be  used 
for  explanatory  purposes  rather  than 
treating  it  as  a  source  of  error,  and  they 
go  beyond  Skinner  and  Guthrie  in  at- 
tempting to  construct  a  formalism  that 
will  permit  unambiguous  statements  of 
assumptions  about  stimulus  variables 
and  rigorous  derivation  of  the  con- 
sequences of  these  assumptions. 

It  has  been  shown  in  a  previous 
paper  (7)  that  several  quantitative  as- 
pects of  learning,  for  example  the  ex- 
ponential curve  of  habit  growth  regu- 
larly obtained  in  certain  conditioning 
experiments,  follow  as  consequences  of 
statistical  assumptions  and  need  not  be 
accounted  for  by  independent  postu- 
lates. All  of  the  derivations  were  car- 
ried out,  however,  under  the  simplifying 
assumption  that  all  components  of  a 
stimulating  situation  are  equally  likely 
to  occur  on  any  trial.  By  removing 
that  restriction,  we  are  now  in  a  posi- 
tion to  generalize  and  extend  the  theory 
in  several  respects.  It  will  be  possible 
to  show  that  regardless  of  whether  as- 
sumptions as  to  the  necessary  condi- 
tions for  learning  are  drawn  from  con- 
tiguity theories  or  from  reinforcement 
theories,  certain  characteristics  of  the 
learning  process  are  invariant  with  re- 
spect to  stimulus  properties  while  other 
characteristics  depend  in  specific  ways 
upon  the  nature  of  the  stimulating 
situation. 

The  Generalized  Set  Model: 
Assumptions  and  Notation 


This  article   appeared   in   Psychol. 


The  exposure  of  an  organism   to  a 
stimulating  situation  determines  a  set 

Rev.,  1953,  60,  276-286.     Reprinted  with  permission. 
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of  events  referred  to  collectively  as 
stimulation.  These  events  constitute 
the  data  of  the  various  special  disci- 
plines concerned  with  vision,  audition, 
etc.  We  wish  to  formulate  our  model 
of  the  stimulus  situation  so  that  infor- 
mation from  these  special  disciplines 
can  be  fed  into  the  theory,  although 
utilization  of  that  information  will  de- 
pend upon  the  demands  of  learning  ex- 
pferiments. 

For  the  present  we  shall  make  only 
the  following  very  general  assumptions 
about  the  stimulating  situation:  (a) 
The  effect  of  a  stimulus  situation  upon 
an  organism  may  be  regarded  as  made 
up  of  many  component  events,  (b) 
When  a  situation  is  repeated  on  a  series 
of  trials,  any  one  of  these  component 
stimulus  events  may  occur  on  some 
trials  and  fail  to  occur  on  others;  as  a 
first  approximation,  at  least,  the  rela- 
tive frequencies  of  the  various  stimulus 
events  when  the  same  situation  (as  de- 
fined experimentally)  occurs  on  a  series 
of  trials,  may  be  represented  by  inde- 
pendent probabilities.  We  formulate 
these  assumptions  conceptually  as  fol- 
lows: 

(a)  With  any  given  organism  we  as- 
sociate a  set  5*  of  N*  elements.^  The 
N*  elements  of  5*  are  to  represent  all 
of  the  stimulus  events  that  can  occur 
in  that  organism  in  any  situation  what- 
ever with  each  of  these  possible  events 
corresponding  to  an  element  of  the  set. 
(b)  For  any  reproducible  stimulating 
situation  we  assume  a  distribution  of 
values  of  the  parameter  6;  we  represent 
by  di  the  probability  that  the  stimulus 
event  corresponding  to  the  i^^  element 
of  S*  occurs  on  any  given  trial. 

-  In  the  sequel,  various  sets  will  be  desig- 
nated by  the  letter  5,  accompanied  by  appro- 
priate subscripts  and  superscripts.  The  letter 
N,  with  the  same  arrangement  of  subscripts 
and  superscripts,  always  denotes  the  size  of 
the  set. 


It  is  assumed  that  any  change  in  the 
situation  (and  we  shall  attempt  to  deal 
only  with  controlled  changes  correspond- 
ing to  manipulations  of  experimental 
variables)  determines  a  new  distribu- 
tion of  values  of  the  6i.  By  repeating 
the  "same"  situation,  we  mean  the  same 
as  described  in  physical  terms,  and  we 
recognize  that,  strictly  speaking,  repeti- 
tion of  the  same  situation  refers  to  an 
idealized  state  of  affairs  which  can  be 
approached  by  increasing  experimental 
control  but  possibly  never  completely 
realized. 

It  is  recognized  that  some  sources  of 
stimulation  are  internal  to  the  organ- 
ism. This  means  that  in  order  to  have 
a  reproducible  situation  in  a  learning 
experiment  it  is  necessary  to  control 
the  maintenance  schedule  of  the  or- 
ganism and  also  activities  immediately 
preceding  the  trial.  In  the  present 
paper  we  shall  not  use  the  term  "trial" 
in  a  sufficiently  extended  sense  to  neces- 
sitate including  in  the  6  distribution 
movement-produced-stimulation  arising 
from  the  responses  occurring  on  the 
trial. 

We  have  noted  that  the  behavior  on 
a  given  trial  is  assumed  to  be  a  function 
of  the  stimulus  elements  which  are 
sampled  on  that  trial.  If  in  a  given 
situation  certain  elements  of  5*  have  a 
probability  ^  =  0  of  being  sampled, 
those  elements  have  a  negligible  effect 
upon  the  behavior  in  that  situation. 
For  this  reason  we  often  represent  a 
specific  situation  by  means  of  a  re- 
duced set  5.  An  element  of  5*  is  in  5 
if  and  only  if  it  has  a  non-zero  value  of 
6  in  the  given  situation.  These  sets 
are  represented  in  Fig.  1.  In  this  con- 
nection, we  must  note  that  a  prob- 
ability of  zero  for  a  given  event  does 
not  mean  that  the  event  can  never 
occur  "accidentally";  this  probability 
has  the  weaker  meaning  that  the  rela- 
tive frequency  of  occurrence  of  the 
event  is  zero  in  the  long  run.     For  a 
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Fig.  1.  A  schematic  representation  of  stimulus  elements,  the  stimulus  space  5*,  the  reduced 
set  5  containing  elements  with  non-zero  0  values  for  a  given  stimulating  situation,  and  the  re- 
sponse classes  A  and  A.  The  arrows  joining  elements  of  S  to  the  response  classes  represent  the 
partition  of  S  into  Sa  and  S^ 


more  detailed  explication  of  this  point 
the  reader  is  referred  to  Cramer  (5). 

It  should  be  clearly  understood  that 
the  probability,  6,  that  a  given  stimulus 
event  occurs  on  a  trial  may  depend 
upon  many  different  environmental 
events.  For  example,  a  stimulus  event 
associated  with  visual  stimulation  may 
depend  for  its  probability  upon  several 
different  light  sources  in  the  environ- 
ment. Suppose  that  for  a  given  stimu- 
lus element,  the  associated  probability 
^  in  a  given  situation  depends  only 
upon  two  separately  manipulable  com- 
ponents of  the  environment,  a  and  b, 
and  that  the  probabilities  of  the  ele- 
ment's being  drawn  if  only  a  or  b  alone 
were  present  are  6a  and  ^s,  respectively. 
Then  the  probability  attached  to  this 
element  in  the  situation  with  both  com- 
ponents present  will  be 

6  =    Oa-\-  Ob  —    daQh- 


The  Response  Model 

The  response  model  formulated  in  a 
previous  paper  (7)  will  be  used  here 
without  any  important  modification. 
We  shall  deal  only  with  the  simple  case 
of  two  mutually  exclusive  and  exhaus- 
tive response  classes.  The  response 
class  being  recorded  in  a  given  situa- 
tion will  be  designated  A  and  the  com- 
plementary class,  A.  The  dependent 
variable  of  the  theory  is  the  probability 
that  the  response  occurring  on  a  given 
trial  is  a  member  of  class  A.  It  is  rec- 
ognized that  in  a  learning  experiment 
the  behaviors  available  to  the  organism 
may  be  classified  in  many  different 
ways,  depending  upon  the  interests  of 
the  experimenter.  The  response  class 
selected  for  investigation  may  be  any- 
thing from  the  simplest  reflex  to  a  com- 
plex chain  of  behaviors  involving  many 
different  groups  of  effectors.    Adequate 
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treatment  of  all  levels  of  response 
specification  would  require  the  formula- 
tion of  a  model  for  the  response  space 
and  will  not  be  attempted  in  the  pres- 
ent paper.  Preliminary  investigation 
of  this  problem  leads  us  to  believe  that 
when  a  response  model  is  elaborated, 
the  theory  developed  in  this  paper  will 
be  found  to  hold  rigorously  for  the  most 
elementary  response  components  and  to 
a  first  approximation  for  simple  re- 
sponse classes  that  do  not  involve 
chaining.  For  experimental  verifica- 
tion of  the  present  theory  we  shall 
look  to  experiments  involving  response 
classes  no  more  complex  than  flexing  a 
limb,  depressing  a  bar,  or  moving  a  key. 

Conditional  Relations  and 
Response  Probability 

We  assume  that  the  behavior  of  an 
organism  on  any  trial  is  a  function,  not 
of  the  entire  population  of  possible 
stimulus  events,  but  only  of  those 
stimulus  events  which  occur  on  that 
trial;  further,  when  learning  takes 
place,  it  involves  a  change  in  the  de- 
pendency of  the  response  upon  the 
stimulus  events  which  have  occurred  on 
the  given  trial. 

Conditional  relations,  or  for  brevity, 
connections,  between  response  classes 
and  stimulus  elements  are  defined  as  in 
other  papers  on  statistical  learning 
theory  (3,  7).  The  response  classes 
A  and  A  define  a  partition  of  5*  into 
two  subsets  5^*  and  5^*.  Elements 
in  Sa*  are  said  to  be  "connected  to" 
or  "conditioned  to"  response  A;  those 
in  Sj*  to  response  A.  The  concept  of 
a  partition  implies  specifically  that 
every  element  of  5*  must  be  connected 
either  to  A  or  to  A  but  that  no  element 
may  be  connected  to  both  simulta- 
neously.^ Various  features  of  the  model 
are  illustrated  in  Fig.   1. 

3  The  argument  of  this  section  could  as  well 
be  given  in  terms  of  the  set  5  as  of  S*,  defin- 


For  each  element  in  5*  we  define  a 
quantity  Fi(n)  representing  the  proba- 
bility that  the  element  in  question  is 
connected  to  response  A,  i.e.,  is  in  the 
subset  Sa*,  at  the  end  of  trial  n.  The 
mean  value  of  Fi(n)  over  S*  is,  then, 
simply  the  expected  proportion  of  ele- 
ments connected  to  A,  and  if  all  of  the 
Oi  were  equal,  it  would  be  natural  to 
define  this  proportion  as  the  probabil- 
ity, p{n),  that  response  A  occurs  on 
trial  »  -t-  1.  In  the  general  case,  how- 
ever, not  all  of  the  di  are  equal  and  the 
contribution  of  each  element  should  be 
weighted  by  its  probability  of  occur- 
rence, giving 

(1)     p(n)  =  -^ 


1 


^^=^Z«"). 


It  will  be  seen  that  in  the  equal  6 
case,  expression  (1)  reduces  to 

(2)     pin)  =  ^^ZF<{n)  =  E{F.(n)) 

which,  except  for  changes  in  notation,  is 
the  definition  used  in  previous  papers 
(6,7). 

The  quantity  p  is,  then,  another  of 
the  principal  constructs  of  the  theory. 
It  is  referred  to  as  a  probability,  firstly 
because  we  assume  explicitly  that  quan- 
tities p  are  to  be  manipulated  mathe- 
matically in  accordance  with  the  axioms 
of  probability  theory,  and  secondly  be- 
cause in  some  situations  p  can  be  given 
a  frequency  interpretation.  In  any 
situation  where  a  sequence  of  responses 
can  be  obtained  under  conditions  of 
negligible  learning  and  independent 
trials  (as  at  the  asymptote  of  a  simple 
learning  experiment  carried  out  with  dis- 
crete, well-spaced  trials)  the  numerical 
value  of  p  is  taken  as  the  average  rela- 
tive frequency  of  response  A.  For  all 
situations  the  construct  p  is  assumed  to 

ing  Sa  and  Sa  as  the  partition  of  5  imposed 
by  the  response  classes  A  and  A. 
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correspond  to  a  parameter  of  the  be- 
havior system,  and  we  do  not  cease  to 
speak  of  this  as  a  probability  in  the 
case  of  a  situation  where  it  cannot  be 
evaluated  as  a  relative  frequency.  It 
has  been  shown  in  a  previous  paper  (7) 
that  p  can  be  related  in  a  simple  man- 
ner to  rate  or  latency  of  responding  in 
many  situations;  thus  in  all  applica- 
tions of  the  theory,  p  is  evaluated  in 
accordance  with  the  rules  prescribed  by 
the  theory,  either  from  frequency  data 
or  from  other  appropriate  data,  and 
once  evaluated  is  treated  for  all  mathe- 
matical purposes  as  a  probability. 

Representation  of  Learning 
Processes 

In  order  to  account  for  the  gradual 
course  of  learning  in  most  situations, 
a  number  of  the  earlier  quantitative 
theories,  e.g.,  those  of  Hull  (13),  Gul- 
liksen  and  Wolfie  (10),  Thur stone 
(18)  have  assumed  that  individual  con- 
nections are  formed  gradually  over  a 
series  of  learning  trials.  Once  we  adopt 
a  statistical  view  of  the  stimulating 
situation,  however,  it  can  be  shown 
rigorously  that  not  only  the  gradual 
course  of  learning  but  the  form  of  the 
typical  learning  curve  can  be  accounted 
for  in  terms  of  probability  considera- 
tions even  if  we  assume  that  connec- 
tions are  formed  on  an  all-or-none 
basis.  This  being  the  case,  there  seems 
to  be  no  evidence  whatsoever  that 
would  require  a  postulate  of  gradual 
formation  of  individual  connections. 
Psychologically  an  all-or-none  assump- 
tion has  the  advantage  of  enabling  us 
to  account  readily  for  the  fact  that 
learning  is  sudden  in  some  situations 
and  gradual  in  others;  mathematically, 
it  has  the  advantage  of  great  simplicity. 
For  these  reasons,  recent  statistical 
theories  of  learning  have  adopted  some 
form  of  the  all-or-none  assumption  (3, 
7,  15). 

Under  an  all-or-none  theory,  we  must 


specify  the  probabilities  that  any  stimu- 
lus element  that  is  sampled  on  a  given 
trial  will  become  connected  to  A  or  to 
A.  For  convenience  in  exposition,  we 
shall  limit  ourselves  in  this  paper  to 
the  simplest  special  case,  i.e.,  a  homoge- 
neous series  of  discrete  trials  with 
probability  equal  to  one  that  all  ele- 
ments occurring  on  a  trial  become  con- 
nected to  response  A. 

We  begin  by  asking  what  can  be  said 
about  the  course  of  learning  during  a 
sequence  of  trials  regardless  of  the  dis- 
tribution of  stimulus  events.  It  will 
be  shown  that  our  general  assumptions 
define  a  family  of  mathematical  opera- 
tors describing  learning  during  any  pre- 
scribed sequence  of  trials,  the  member 
of  the  family  applicable  in  a  given  situ- 
ation depending  upon  the  6  distribution. 
We  shall  first  inquire  into  the  charac- 
teristics common  to  all  members  of  a 
family,  and  then  into  the  conditions 
under  which  the  operators  can  be  ap- 
proximated adequately  by  the  relatively 
simple  functions  that  have  been  found 
convenient  for  representing  learning 
data  in  previous  work. 

Let  us  consider  the  course  of  learn- 
ing during  a  sequence  of  trials  in  the 
simplified  situation.  Each  trial  in  the 
series  is  to  begin  with  the  presentation 
of  a  certain  stimulus  complex.  This 
situation  defines  a  distribution  of  6  over 
5*  so  that  each  element  in  5*  has  some 
probability,  di,  of  occurring  on  any  trial, 
and  we  represent  by  S  the  subset  of 
elements  with  non-zero  6  values;  any 
element  that  occurs  on  a  trial  becomes 
connected  to  ^  (or  remains  connected 
to  A  if  it  has  been  drawn  on  a  previous 
trial).  For  concreteness  the  reader 
might  think  of  a  simple  conditioning 
experiment  with  the  CS  preceding  the 
US  by  an  optimal  interval,  and  with 
conditions  arranged  so  that  the  UR  is 
evoked  on  each  trial  and  decremental 
factors  are  negligible;  the  situation  rep- 
resented by  S  is  that  obtaining  from  the 


W.    K.   ESTES  AND   C.   J.   BURKE 


337 


onset  of  the  CS  to  the  onset  of  the  US, 
and  the  response  probability  p  will  re- 
fer to  the  probability  of  ^  in  this  situa- 
tion. The'  number  of  elements  in  S  will 
be  designated  by  N.  For  simplicity  we 
shall  suppose  in  the  following  deriva- 
tions that  none  of  the  elements  in  S  are 
connected  to  A  at  the  beginning  of  the 
experiment.  This  means  that  the  learn- 
ing curves  obtained  all  begin  with  N^. 
and  p  equal  to  zero.  No  loss  of  gen- 
erality is  involved  in  this  simplifica- 
tion ;  our  results  may  easily  be  extended 
to  the  case  of  any  arbitrary  initial 
condition. 

The  »*•*  element  in  5  will  still  remain 
in  Sj  after  the  w**'  trial  if  and  only  if 
it  is  not  sampled  on  any  of  the  first  n 
trials;  the  likelihood  that  this  occurs 
is  {\  —  6iY.  Hence,  if  Fi{n)  repre- 
sents the  expected  probability  that  this 
element  is  connected  to  A  after  the  n''' 
trial,  we  obtain: 


(3) 


F,{n)  =  1  -  (1  -  ed-. 


The  expected  number  of  elements  in  5 
connected  to  A  after  the  n^^  trial, 
E[Ni{n)],  will  be  the  sum  of  these  ex- 
pected contributions  from  individual 
elements: 

(4)  E[_NA{n)']  =  i:Fi{n) 

i 

=  Z  [1  -  (1  -  e.)"] 

i 

=  iv  -  E  (1  -  ^i)". 

We  are  now  in  a  position  to  express 
p,  the  probability  of  response  ^,  as  a 
function  of  the  number  of  trials  in  this 
situation.  By  substituting  for  the  term 
Fi{n)  of  equation  (1)  its  equivalent 
from  equation  (3),  we  obtain  the  re- 
lation 

(5)  P{n)  =  j^^^el\-{\-e:)--] 


Equation  (5)  defines  a  family  of 
learning  curves,  one  for  each  possible 
6  distribution,  and  it  has  a  number  of 
simple  properties  that  are  independent 
of  the  distribution  of  the  $i.  It  can 
easily  be  verified  by  substitution  that 
there  is  a  fixed  point  at  />  =  1,  and  this 
will  be  the  asymptote  approached  by 
the  curve  of  />(«)  vs.  n  as  «  increases 
over  all  bounds.  Members  of  the 
family  will  be  monotonically  increasing, 
negatively  accelerated  curves,  approach- 
ing a  simple  negative  growth  function 
as  the  Bi  tend  toward  equality.  If  all 
of  the  9i  are  equal  to  6,  equation  (5) 
reduces  to 


(6) 


p{n)  =  1  -  (1  -  ^)» 


which,  except  for  a  change  in  notation, 
is  the  same  function  derived  previously 
(7)  for  the  equal  6  case*  and  corre- 
sponds to  the  linear  operator  used  by 
Bush  and  Mosteller  (2)  for  situations 
where  no  decremental  factor  is  in- 
volved. In  mathematical  form,  equa- 
tion (6)  is  the  same  as  Hull's  well- 
known  expression  for  growth  of  habit 
strength,  but  the  function  does  not 
have  the  same  relation  to  observed 
probability  of  responding  in  Hull's 
theory  as  in  the  present  formulation. 
Except  where  the  distribution  func- 
tion of  the  9i  either  is  known,  or  can 
be  assumed  on  theoretical  grounds  to 
be  approximated  by  some  simple  ex- 
pression, equation  (5)  will  not  be  con- 
venient to  work  with.  In  practice  we 
are  apt  to  assume  equal  di  and  utilize 
equation  (6)  to  describe  experimental 
data.  The  nature  of  the  error  of  ap- 
proximation involved  in  doing  this  can 
be  stated  generally.  Immediately  after 
the  first  trial,  the  curve  for  the  general 
case  must  lie  above  the  curve  for  the 

4  This  is  essentially  the  same  function  de- 
veloped for  the  equal  $  case  in  a  previous 
paper  (7) ;  the  terms  9  and  n  of  equation  (6) 
correspond  to  the  terms  q  =  s/S,  and  T  of 
that  paper. 
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Fig.  2.  Response  probability,  in  S,  as  a  function  of  number  of  trials  for  the  numerical  ex- 
amples presented  in  the  text.  The  solid  curve  is  the  exact  solution  for  a  population  of  elements, 
half  of  which  have  ^  =  0.1  and  half  ^  =  0.3.  The  dashed  curve  describes  the  equal  $  approxima- 
tion with  ^  =  0.2.    Initially  no  elements  of  S  are  conditioned  to  A. 


equal  6  case;  the  difference  between  the 
two  curves  increases  for  a  few  trials, 
then  decreases  until  they  cross  (in  con- 
structing hypothetical  6  distributions  of 
diverse  forms  we  have  usually  found 
this  crossing  in  the  neighborhood  of 
the  fourth  to  eighth  trial) ;  after  cross- 
ing, the  curves  diverge  to  a  smaller  ex- 
tent than  before,  then  come  together  as 
both  go  to  the  same  as3niiptote  at 
p  =  I.  It  can  be  proved  that  the 
curves  for  the  general  and  special  case 
cross  exactly  once  as  n  goes  from  one 
to  infinity.  We  cannot  make  any  gen- 
eral statement  about  the  maximum  er- 
ror involved  in  approximating  expres- 
sion (5)  with  expression  (6),  but  after 
studying  a  number  of  special  cases,  we 
are  inclined  to  believe  that  the  error 
introduced  by  the  approximation  will 
be  too  small  to  be  readily  detectable 
experimentally  for  most  simple  learn- 
ing situations  that  do  not  involve  com- 
pounding of  stimuli. 

The  development  of  equations  (5) 
and  (6)  has  necessarily  been  given  in 
rather  general  terms,  and  it  may  be 
helpful  to  illustrate  some  of  the  con- 


siderations involved  by  means  of  a  sim- 
ple numerical  example.  Imagine  that 
we  are  dealing  with  some  particular 
conditioning  experiment  in  which  the 
CS  can  be  represented  by  a  set  5,  com- 
posed of  two  subsets  of  stimulus  ele- 
ments, 5i  and  Sj,  of  the  sizes  N^  =  N^ 
=  N/2,  where  N  is  the  number  of  ele- 
ments in  S.  Assume  that  for  all  ele- 
ments in  5i  the  probability  of  being 
drawn  on  any  trial  is  ^^  =  0.3  and  for 
those  in  Sg,  $2  =  ^■^-  Now  we  wish  to 
compute  the  predicted  learning  curve 
during  a  series  of  trials  on  which  A  re- 
sponses are  reinforced,  assuming  that 
we  begin  with  all  elements  connected  to 
/4.     Equation  (S)  becomes 

p(n)  =  l-^lNr{0.3) 

X(l-0.3)»+iV2(0.1)(l-0.1)'^] 

=  l-^[0.3(0.7)"+0.1(0.9)«]. 

Plotting  numerical  values  computed 
from  this  equation,  we  obtain  the  solid 
curve  given  in  Fig.  2. 

Now  let  us  approach  the  same  prob- 
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lem,  but  supposing  this  time  that  we 
know  nothing  about  the  different  6 
values  in  the  subsets  S^  and  5,  and  are 
given  only  that  d  =  0.2.  We  now  ob- 
tain predicted  learning  curves  under 
the  equal  9  approximation.  Equation 
(6)  becomes: 

p{n)  =  1  -  (1  -  0.2)" 

and  numerical  values  computed  from 
this  yield  the  dashed  curve  of  Fig.  2. 

Inspection  of  Fig.  2  shows  that  the 
exact  treatment  leads  to  higher  values 
of  p{n)  on  the  early  trials  but  to  lower 
values  on  the  later  trials,  the  difference 
becoming  negligible  for  large  n.  The 
reason,  in  brief,  for  the  steeper  curva- 
ture of  the  exact  curve  is  that  elements 
with  high  6  values  are  likely  to  be 
drawn,  and  therefore  conditioned  to  A, 
earlier  in  the  learning  process  than  ele- 
ments with  low  6  values,  and  then  be- 
cause they  will  tend  to  recur  frequently 
in  successive  samples,  to  lead  to  rela- 
tively high  values  of  p.  During  the 
late  stages  of  learning,  elements  with 
low  6  values  that  have  not  been  drawn 
on  the  early  trials  will  contribute  more 
unconnected  elements  per  trial  than 
would  be  appearing  at  the  same  stage 
with  an  equal  6  distribution  and  will 
depress  the  value  of  p  below  the  curve 
for  the  equal  6  approximation. 

It  should  be  emphasized  that  the 
generality  of  the  present  approach  to 
learning  theory  lies  in  the  concepts  in- 
troduced and  the  methods  developed 
for  operating  with  them,  not  in  the 
particular  equations  derived.  Equa- 
tion (5),  for  example,  can  be  expected 
to  apply  only  to  an  extremely  narrow 
class  of  learning  experiments.  On  the 
other  hand,  the  methods  utilized  in  de- 
riving equation  (5)  are  applicable  to  a 
wide  variety  of  situations.  For  the  in- 
terest of  the  experimentally  oriented 
reader,  we  will  indicate  briefly  a  few  of 
the    most    obvious    extensions    of    the 


theory  developed  above,  limiting  our- 
selves to  the  equal  6  case. 

As  written,  equation  (6)  represents 
the  predicted  course  of  conditioning  for 
a  single  organism  with  an  initial  re- 
sponse probability  of  zero.  We  can 
allow  for  the  possibility  that  an  experi- 
ment may  begin  at  some  value  of  ^(0) 
other  than  zero  by  rewriting  (6)  in  the 
more  general  form 

(7)     p{n)  =:  1  -  [1  -  pmn  -  ey 

which  has  the  same  form  as   (6)   ex- 
cept for  the  initial  value. 

If  we  wish  to  consider  the  mean 
course  of  conditioning  in  a  group  of  m 
organisms  with  like  values  of  6  but 
varying  initial  response  probabilities, 
we  need  simply  sum  equation  (7)  over 
the  group  and  divide  by  m,  obtaining 

(8)     Pin)  =  ^Zpin) 
m 

=  1  -  [1  -  p(o)](i  -  ey. 

The  standard  deviation  of  p{n)  un- 
der these  circumstances  is  simply 


(9)     <tM  =  \Jl^i:  P'(n)  -  pKn) 
=  (1  -  ^)Vp(O) 

where  o-p(O)  is  the  dispersion  of  the 
initial  p  values  for  the  group.  Varia- 
bility around  the  mean  learning  curve 
decreases  to  zero  in  a  simple  manner 
as  learning  progresses. 

The  treatment  of  counter-condition- 
ing, i.e.,  extinguishing  one  response  by 
giving  uniform  reinforcement  to  a  com- 
peting response,  follows  automatically 
from  our  account  of  the  acquisition 
process.  Returning  to  equation  (6) 
and  recalling  that  the  probabilities  of  A 
and  A  must  always  sum  to  unity,  we 
note  that  while  response  A  undergoes 
conditioning  in  accordance  with  (6), 
response  A  must  undergo  extinction  in 
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accordance  with  the  function 

plin)  =  1  -  pA{n)  =  (1  -  e)\ 

If,  then,  we  begin  with  any  arbitrary 
p(0)  and  arrange  conditions  so  that  A 
is  evoked  and  conditioned  to  all  ele- 
ments drawn  on  each  trial,  the  ex- 
tinction of  response  A  will  be  given  by 
the  simple  decay  function 

(10)  p{n)  =  p{0){l  -  e)\ 

Again  the  mean  and  standard  deviation 
of  p{n)  can  easily  be  computed  for  a 
group  of  organisms  with  like  values  of 
6  but  varying  values  of  p{0): 

(11)  p{n)  =  p(0){l-dy 

(12)  crM  =  (1  -  ^)Vp(O). 

As  in  the  case  of  acquisition,  variability 
around  the  mean  curve  decreases  to 
zero  in  a  simple  manner  over  a  series  of 
trials. 

Since  variability  due  to  variation  in 
p{0)  is  reduced  during  both  condition- 
ing and  counter-conditioning,  it  will  be 
seen  that  in  general  we  should  expect 
less  variability  around  a  curve  of  re- 
learning  than  around  a  curve  of  original 
learning  for  a  given  group  of  subjects. 

Application  of  the  Statistical 

Model  to  Learning 

Experiments 

Since  our  concern  in  this  paper  has 
been  with  the  development  of  a  stimu- 
lus model  of  considerable  generality,  it 
has  been  necessary  in  the  interests  of 
clear  exposition  to  omit  reference  to 
most  of  the  empirical  material  upon 
which  our  theoretical  assumptions  are 
based.  The  evaluation  of  the  model 
must  rest  upon  detailed  interpretation 
of  specific  experimental  situations.  It 
is  clear,  however,  that  the  statistical 
model  developed  here  cannot  be  tested 
in  isolation;  only  when  it  is  taken  to- 
gether with  assumptions  as  to  how 
learning  occurs  and  with  rules  of  cor- 


respondence between  terms  of  the 
theory  and  experimental  variables,  will 
experimental  evaluation  be  possible. 
Limitations  of  space  preclude  a  detailed 
theoretical  analysis  of  individual  learn- 
ing situations  in  this  paper.  In  order 
to  indicate  how  the  model  will  be  uti- 
lized and  to  suggest  some  of  its  ex- 
planatory potentialities  we  shall  con- 
clude with  a  few  general  remarks 
concerning  the  interpretation  of  learn- 
ing phenomena  within  the  theoretical 
framework  we  have  developed. 

Application  of  the  model  to  any  one 
isolated  experiment  will  always  involve 
an  element  of  circularity,  for  informa- 
tion about  a  given  6  distribution  must 
be  obtained  from  behavioral  data. 
This  circularity  disappears  as  soon  as 
data  are  available  from  a  number  of 
related  experiments.  The  utility  of  the 
concept  is  expected  to  lie  in  the  possi- 
bility of  predicting  a  variety  of  facts 
once  the  parameters  of  the  0  distribu- 
tion have  been  evaluated  for  a  situa- 
tion. The  methodology  involved  has 
been  illustrated  on  a  small  scale  by  an 
experiment  (6)  in  which  the  mean  6 
value  for  an  operant  conditioning  situa- 
tion was  estimated  from  the  acquisition 
curve  of  a  bar-pressing  habit  and  then 
utilized  in  predicting  the  course  of 
acquisition  of  a  second  bar-pressing 
habit  by  the  same  animals  under 
slightly  modified  conditions. 

When  the  statistical  model  is  taken 
together  with  an  assumption  of  associa- 
tion by  contiguity,  we  have  the  essen- 
tials of  a  theory  of  simple  learning. 
The  learning  functions  (5),  (6),  and 
(10)  derived  above  should  be  expected 
to  provide  a  description  of  the  course 
of  learning  in  certain  elementary  ex- 
periments in  the  areas  of  conditioning 
and  verbal  association.  It  must  be  em- 
phasized, however,  that  these  functions 
alone  will  not  constitute  an  adequate 
theory  of  conditioning,  for  a  number  of 
relevant  variables,  especially  those  con- 
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trolling  response  decrement,  have  not 
been  taken  into  account  in  our  deriva- 
tions. In  conditioning  experiments 
where  decremental  factors  are  mini- 
mized, there  is  considerable  evidence 
(1,  4,  9,  14,  16)  that  the  curve  of  con- 
ditioning has  the  principal  properties  of 
our  equation  (5)  and  can  be  well  ap- 
proximated by  the  equal  6  case  (7). 
The  fact  that  functions  derived  from 
the  model  can  be  fitted  to  certain  em- 
pirical curves  is  a  desirable  outcome,  of 
course,  but  cannot  be  regarded  as  pro- 
viding a  very  exacting  test  of  the 
theory;  probably  any  contemporary 
quantitative  theory  will  manage  to  ac- 
complish this  much.  On  the  other 
hand,  the  fact  that  the  properties  of  our 
learning  functions  follow  from  the  sta- 
tistical nature  of  the  stimulating  situa- 
tion is  of  some  interest;  in  this  respect 
the  structure  of  the  present  theory  is 
simpler  than  certain  others,  e.g.,  that 
of  Hull  (13),  which  require  an  inde- 
pendent postulate  to  account  for  the 
form  of  the  conditioning  curve. 

It  should  also  be  noted  that  devia- 
tions from  the  exponential  curve  form 
may  be  as  significant  as  instances  of 
good  fit.  From  the  present  model  we 
must  predict  a  specific  kind  of  deviation 
when  the  stimulating  situation  contains 
elements  of  widely  varying  6  values. 
If,  for  example,  curves  of  conditioning 
to  two  stimuli  taken  separately  yield 
significantly  different  values  of  6,  then 
the  curve  of  conditioning  to  a  com- 
pound of  the  two  stimuli  should  be  ex- 
pected to  deviate  further  than  either  of 
the  separate  curves  from  a  simple 
growth  function.  The  only  relevant  ex- 
periment we  have  discovered  in  the 
literature  is  one  reported  by  Miller 
(16);  Miller's  results  appear  to  be  in 
line  with  this  analysis,  but  we  would 
hesitate  to  regard  this  aspect  of  the 
theory  as  substantiated  until  additional 
relevant  data  become  available. 

Although  we  shall  not  develop  the 


argument  in  mathematical  detail  in  the 
present  paper,  it  may  be  noted  that  the 
statistical  association  theory  yields  cer- 
tain specific  predictions  concerning  the 
effects  of  past  learning  upon  the  course 
of  learning  in  a  new  situation.  In  gen- 
eral, the  increment  or  decrement  in  p 
during  any  trial  depends  to  a  certain 
extent  upon  the  immediately  preceding 
sequence  of  trials.  Suppose  that  we 
have  two  identical  animals  each  of 
which  has  /»(«)  equal,  say,  to  0.5  at  the 
end  of  trial  n  of  an  experiment,  and 
suppose  that  for  each  animal  response 
A  is  reinforced  on  trial  n  +  I.  The 
histories  of  the  two  animals  are  pre- 
sumed to  differ  in  that  the  first  animal 
has  arrived  at  p{n)—O.S  via  a  se- 
quence of  reinforced  trials  while  the 
second  animal  has  arrived  at  this  value 
via  a  sequence  of  unreinforced  trials. 
On  trial  n  4-  1,  the  second  animal  will 
receive  the  greater  increment  to  p  (ex- 
cept in  the  equal  9  case) ;  the  reason  is, 
in  brief,  that  for  both  animals  the 
stimulus  elements  most  likely  to  occur 
on  trial  n  +  \  are  those  with  high  6 
values;  for  the  first  animal  these  ele- 
ments will  have  occurred  frequently 
during  the  immediately  preceding  se- 
quence of  trials  and  thus  will  tend  to 
be  preponderantly  connected  to  A  prior 
to  trial  »  -h  1 ;  in  the  case  of  the  sec- 
ond animal,  the  high  6  elements  will 
have  been  connected  to  A  during  the 
immediately  preceding  sequence  and 
thus  when  A  is  reinforced  on  trial 
n  -\-  \,  the  second  animal  will  receive 
the  greater  increment  in  weight  of  con- 
nected elements.  From  this  analysis  it 
follows  that,  other  things  equal,  a  curve 
of  reconditioning  will  approach  its 
asymptote  more  rapidly  than  the  curve 
of  original  conditioning  unless  extinc- 
tion has  actually  been  carried  to  zero. 
How  important  the  role  of  the  unequal 
6  distribution  will  prove  to  be  in  ac- 
counting for  empirical  phenomena  of 
relearning  cannot  be  adequately  judged 
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until  further  research  has  provided 
means  for  estimating  the  orders  of  mag- 
nitude of  the  effects  we  have  mentioned 
here. 

Summary 

Earlier  statistical  treatments  of  sim- 
ple associative  learning  have  been  re- 
fined and  generalized  by  analyzing  the 
stimulus  concept  in  greater  detail  than 
heretofore  and  by  taking  account  of  the 
fact  that  different  components  of  a 
stimulating  situation  may  have  differ- 
ent probabilities  of  affecting  behavior. 

The  population  of  stimulus  events 
corresponding  to  an  independent  ex- 
perimental variable  is  represented  in 
the  statistical  model  by  a  mathematical 
set.  The  relative  frequencies  with 
which  various  aspects  of  the  stimulus 
variable  affect  behavior  in  a  given  ex- 
periment are  represented  by  set  opera- 
tions and  functions. 

The  statistical  model,  taken  together 
with  an  assumption  of  association  by 
contiguity,  provides  a  limited  theory 
of  certain  conditioning  phenomena. 
Within  this  theory  it  has  been  possible 
to  distinguish  aspects  of  the  learning 
process  that  depend  upon  properties  of 
the  stimulating  situation  from  those 
that  do  not.  Certain  general  predic- 
tions from  the  theory  concerning  ac- 
quisition, extinction,  and  relearning,  are 
compared  with  experimental  findings. 

Salient  characteristics  of  the  model 
elaborated  here  are  compared  with 
other  quantitative  formulations  of 
learning. 
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ANALYSIS  OF  A  VERBAL   CONDITIONING  SITUATION  IN 
TERMS  OF  STATISTICAL  LEARNING  THEORY^ 
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Indiana  University 


It  is  the  purpose  of  this  study  to 
investigate  the  theoretical  significance 
of  a  rather  striking  coincidence  be- 
tween an  experimental  fact  and  a 
mathematical  fact.  The  experimental 
fact  has  been  established  in  the 
Humphreys-type  "verbal  condition- 
ing" situation.  In  this  situation  S  is 
asked  to  predict  on  each  of  a  series  of 
trials  whether  some  designated  event, 
e.g.,  the  flash  of  a  light,  will  occur; 
this  event,  the  analogue  of  the  US  in  a 
conditioning  experiment,  is  presented 
in  accordance  with  a  predetermined 
schedule,  usually  random  with  some 
fixed  probability.  Several  recent  in- 
vestigators (3,  5)  have  noted  that  S 
tends  to  match  his  response  rate  to 
the  rate  of  occurrence  of  the  predicted 
event  so  that  if  the  probability  of  the 
latter  is,  say,  .75,  the  mean  response 
curve  for  a  group  of  Ss  tends  over  a 
series  of  trials  toward  an  apparently 
stable  final  level  at  which  the  event  is 
predicted  on  approximately  75%  of 
the  trials.  This  behavior  has  seemed 
puzzling  to  most  investigators  since  it 
does  not  maximize  the  proportion  of 
successful  predictions  and  thus  does 

^  This  research  was  faciHtated  by  the  senior 
author's  tenure  as  a  faculty  research  fellow  of  the 
Social  Science  Research  Council. 


not  conform  to  conventional  law  of 
effect  doctrine.  The  mathematical 
fact  which  will  concern  us  appeared  in 
the  course  of  developing  the  formal 
consequences  of  statistical  association 
theory  (1,  2);  in  a  simple  associative 
learning  situation  satisfying  certain 
conditions  of  symmetry,  the  theoretical 
asymptote  of  response  probability 
turns  out  to  be  equal  to  the  probabil- 
ity of  reinforcement.  The  reasoning 
involved  may  be  sketched  briefly  as 
follows. 

We  consider  a  situation  in  which 
each  trial  begins  with  presentation  of 
a  signal,  or  CS;  following  the  signal, 
one  or  the  other  of  two  reinforcing 
stimuli.  El  or  E2,  occurs,  the  proba- 
bility of  El  and  E2  during  a  given 
series  being  ir  and  1— tt,  respectively. 
The  behaviors  available  to  S  are  cate- 
gorized into  two  classes,  Ai  and  A2,  by 
experimental  criteria.  In  the  verbal 
conditioning  situation,  Ai  is  a  predic- 
tion that  El  will  occur,  and  A2  a  pre- 
diction that  E2  will  occur  on  the  given 
trial.  We  assume  that  the  CS  deter- 
mines a  population,  Sc,  of  stimulus 
elements  which  is  sampled  by  S  on 
each  trial,  the  proportion  6  of  the 
elements  in  this  population  constitut- 
ing the  effective  sample  on  any  one 

This  article  appeared  in  J.  exp.  Psychol,  1954,  47,  225-234.     Reprinted  with  permission. 
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trial.  The  dependence  of  S's  responses 
upon  the  stimulating  situation  is  ex- 
pressed in  the  theory  by  defining  a 
conditional  relationship  such  that  each 
element  in  Sc  is  conditioned  to  (tends 
to  evoke)  either  Ai  or  A2.  In  order  to 
interpret  the  formal  model  in  terms  of 
a  verbal  conditioning  experiment,  we 
assume  that  when  an  Ei  occurs  it 
evokes  from  S  a  response  belonging  to 
class  Ai,  i.e.,  one  which  is  compatible 
with  the  response  of  predicting  Ei  but 
which  interferes  with  the  response  of 
predicting  E2,  and  that  when  an  E2 
occurs  it  evokes  a  response  of  class  Aj. 
Then  on  a  trial  on  which  Ei  occurs  we 
expect  on  the  basis  of  association  prin- 
ciples (1)  that  all  elements  sampled 
from  Sc  on  the  trial  will  become  con- 
ditioned to  Ai  while  on  an  E2  trial  the 
sample  will  be  conditioned  to  A2. 
Now  if  successive  trials  are  sufficiently 
discrete  so  that  samples  from  Sc  are 
statistically  independent,  the  proba- 
bility of  an  Ai  after  Trial  n,  abbrevi- 
ated p(n),  is  defined  in  the  model  as 
the  proportion  of  elements  in  Sc  that 
are  conditioned  to  Ai,  and  similarly 
for  the  probability  of  an  A2,  [.l—p{n)^. 
With  these  definitions  the  rule  for  cal- 
culating the  change  in  response  prob- 
ability on  an  Ei  trial  may  be  stated 
formally  as 

p(n-\-l)  =  {l-e)p(n)  +d      (1) 

and  on  an  E2  trial  as 

Pin  +  1)  =  (l-e)p(n).         (2) 

The  genesis  of  these  equations  will  be 
fairly  obvious.  The  proportion  (1-^) 
of  stimulus  elements  is  not  sampled, 
and  the  status  of  elements  that  are  not 
sampled  on  a  trial  does  not  change; 
the  proportion  6  is  sampled  and  these 
elements  are  all  conditioned  either  to 
Ai  or  to  A2  accordingly  as  an  Ei  or  an 
E2  occurs.^     Now  in  a  random  rein- 

*  Consequently  the  functions  derived  in  this 
paper  should  be  expected  to  apply  only  to  leam- 


forcement  situation.  Equation  1  will 
be  applicable  on  the  proportion  tt  of 
trials  and  Equation  2  on  the  propor- 
tion (1— jt);  then  the  average  proba- 
bility of  Ai  after  Trial  n  -\-  I  will  be 
given  by  the  relation 

p(n+l)  =  7r[(l-5)p(«)  +  0] 

+  (l-,r)(l-0)p(n)      (3) 
=  •(1  -6)  Pin)  +&T. 

If  a  group  of  Ss  begins  an  experiment 
with  the  value  p(0),  then  at  the  end 
of  Trial  1  we  would  have 

p(i)  =  ii-e)viO)+6T, 

at  the  end  of  Trial  2 

p(2)  =  (i-0)[(i  -e)viO) 

+  e-ir']  +  dir 

=  X  -  [t  -  p(0)](i  -  e)\ 

and  so  on  for  successive  trials;  in 
general  it  can  be  shown  by  induction 
that  at  the  end  of  the  nth.  trial 

Pin)=^-{jr-  p(O)](l-0)-.    (4) 

Since  (1  —  6)  must  be  a  fraction  be- 
tween zero  and  one,  it  will  be  seen 
that  Equation  4  must  be  a  negatively 
accelerated  curve  running  from  the 
initial  value  p(0)  to  the  asymptotic 
value  T. 

This  outcome  of  the  statistical  learn- 
ing model  is  rather  surprising  at  first 
since  it  makes  asymptotic  response 
probability  depend  solely  upon  the 
probability  of  reinforcement.  It  seems, 
however,  to  be  in  excellent  agreement 
with  the  experimental  results  of  Grant, 
Hake,  and  Hornseth  (3)  and  Jarvik 
(5).     The  question  that  interests  us 

ing  situations  which  are  symmetrical  in  the  fol- 
lowing sense.  To  each  response  class  there  must 
correspond  a  reinforcing  condition  which,  if  pres- 
ent on  any  trial,  ensures  that  a  response  belong- 
ing to  the  class  will  terminate  the  trial.  These 
functions  should,  for  example,  be  applicable  to 
learning  of  a  simple  left-right  discrimination  with 
correction ;  but  not  to  a  left-right  discrimination 
without  correction,  to  free  responding  in  the 
Skinner  box,  or  to  Pavlovian  conditioning. 
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now  is  whether  this  agreement  is  to  be 
regarded  as  a  remarkable  coincidence 
or  as  a  confirmation  of  the  theory. 
We  cannot  estimate  a  confidence  level 
for  the  latter  conclusion  since  the  ex- 
periments were  not  conducted  specifi- 
cally to  test  the  theory,  and  we  cannot 
guarantee  that  we  would  be  as  alert  to 
notice  results  contrary  to  the  theory 
which  might  appear  in  the  literature 
a§  we  have  been  in  the  case  of  these 
decidedly  positive  instances.  It  has 
seemed  to  us  that  the  least  objection- 
able way  out  of  this  impasse  is  to 
carry  out  some  new  experiments,  mak- 
ing use  of  one  of  the  convenient  fea- 
tures of  a  mathematical  theory,  namely, 
that  if  it  will  generate  one  testable 
prediction  for  a  given  experimental 
situation,  it  can  generally  be  made  to 
yield  many  more.  In  the  experiment 
to  be  reported  we  have  tried  to  set  up 
a  situation  similar  in  essentials  to  that 
used  by  Humphreys,  Grant,  and  others 
with  an  experimental  design  which 
would  permit  testing  of  a  variety  of 
consequences  of  the  theory.  Each  S 
was  run  through  two  successive  series 
of  120  trials  in  an  individualized  modi- 
fication of  the  Humphreys  situation 
with  the  schedule  of  ir  values  shown  in 
Table  1,  Within  the  first  series  we 
will  be  able  to  compare  learning  rates 
and  asymptotes  of  groups  starting 
from  similar  initial  values  but  exposed 
to  different  probabilities  of  reinforce- 
ment ;  within  the  second  series  we  will 
be  able  to  compare  groups  starting  at 
different  initial  values  but  exposed  to 
the  same  probabilities  of  reinforce- 
ment. Comparison  of  Group  I  with 
the  other  groups  over  both  series  will 
permit  evaluation  of  the  stability  of 
learning  rate  {B  value)  from  series  to 
series  when  the  tt  value  does  or  does 
not  change.  Series  Ia  and  series  IIb 
will  provide  a  comparison  in  which 
initial  response  probabilities  and  ir 
values  are  the  same  but  the  amount  of 


TABLE  1 

Experimental  Design  in  Terms  of  Proba- 
bility OF  Reinforcement  (tt  Value) 
DURING  Each  Series 


Group 

N 

Trials  1-120 
Series  A 

Trials  121-240 
Series  B 

I 

II 
III 

16 
16 
16 

.30 
.50 

.85 

.30 
.30 
.30 

preceding  reinforcement  differs.  In 
order  to  separate  the  effect  of  over-all 
IT  value  from  that  of  particular  orders 
of  event  occurrences,  each  of  the  three 
groups  indicated  in  Table  1  has  been 
subdivided  into  four  subgroups  of  four 
Ss  each ;  within  a  treatment  group,  say 
Group  I,  all  subgroups  have  the  same 
■K  value  but  each  receives  a  separate 
randomly  drawn  sequence  of  Ei's  and 
Ea's. 

Method 

Apparatus. — ^The  experiment  was  run  in  a 
room  containing  a  2-ft.  square  signal  board  and 
four  booths.  Upon  the  signal  board  were 
mounted  12  12-v.,  .25-amp.  light  bulbs  spaced 
evenly  in  a  circle  18  in.  in  diameter.  The  bulbs 
occupied  the  half-hour  positions  of  a  clock  face. 
Only  the  top  two  lights  on  the  board  were  used 
as  signals  in  this  experiment.  The  signal  board 
was  mounted  vertically  on  a  table  40  in.  high 
and  was  about  5  ft.  in  front  of  Ss'  booths. 

The  booths  were  made  from  two  30  X  60  in. 
tables,  30  in.  high,  placed  end  to  end  but  meeting 
at  an  angle  so  that  Ss  sitting  behind  them  would 
be  facing  almost  directly  toward  the  signal  board, 
about  7  ft.  in  front  of  Ss'  eyes.  Two  Ss  sat  at 
each  table.  The  four  Ss  were  separated  from 
one  another  by  panels  2  ft.  high  and  32  in.  wide. 
These  panels  were  mounted  vertically  on  the 
table  tops  so  as  to  extend  14^  in.  beyond  the 
edge  of  the  table  between  the  seated  Ss. 

In  each  booth,  18  in.  back  from  S's  edge  of 
the  table,  was  a  wooden  panel  12  in.  high 
mounted  vertically  on  the  table  top  and  extend- 
ing across  the  width  of  the  booth.  On  the  side 
of  this  panel  facing  S  were  two  reinforcing  lights 
of  the  same  size  as  those  on  the  signal  board  but 
covered  by  white,  translucent  lenses.  These 
lights  were  directly  in  front  of  S,  4  in.  apart  and 
8  in.  above  the  table  top.  On  the  table  below 
each  reinforcing  light  was  a  telegraph  key. 

The  orders  of  presentation  and  the  durations 
of  the  signal  lights  and  reinforcing  lights  were 
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controlled  by  a  modified  Esterline-Angus  re- 
corder using  a  punched  tape  and  a  system  of 
electrical  pick-up  brushes.  The  recorder  was 
placed  on  the  table  behind  the  signal  board. 
Recorder  pens  which  were  activated  by  depres- 
sion of  the  telegraph  keys  in  Ss'  booths  were 
mounted  between  the  brushes.  Thus,  the  pre- 
sentations of  the  lights  and  Ss'  responses  were 
recorded  on  the  same  tape.  A  panel  light  was 
mounted  above  the  Esterline-Angus  recorder  so 
that  E,  seated  behind  the  signal  board,  could 
watch  the  operation  of  brushes  and  pens  during 
the  experiment. 

Windows  in  the  experimental  room  were  cov- 
ered with  opaque  material  and  the  experiment 
was  run  in  darkness  except  for  light  that  came 
from  the  apparatus. 

Subjects. — The  Ss  were  48  students  obtained 
from  beginning  lecture  courses  in  psychology 
during  the  fall  semester  of  1952  and  assigned  at 
random  to  experimental  groups. 

Procedure. — At  the  beginning  of  a  session,  Ss 
were  brought  into  the  room,  asked  to  be  seated, 
and  read  the  following  instructions: 

"Be  sure  you  are  seated  comfortably;  it  will 
be  necessary  to  keep  one  hand  resting  lightly 
beside  each  of  the  telegraph  keys  throughout  the 
experiment  and  to  watch  both  the  large  board  in 
the  front  of  the  room  and  the  two  small  lights  in 
your  own  compartment.  Your  task  in  this  ex- 
periment will  be  to  outguess  the  experimenter  on 
each  trial,  or  at  least  as  often  as  you  can.  The 
ready  signal  on  each  trial  will  be  a  flash  from  the 
two  top  lights  on  the  big  board.  About  a  second 
later  either  the  left  or  the  right  lamp  in  your 
compartment  will  light  for  a  moment.  As  soon 
as  the  ready  signal  flashes  you  are  to  guess 
whether  the  left  or  the  right  lamp  will  light  on 
that  trial  and  indicate  your  choice  by  pressing 
the  proper  key.  If  you  expect  the  left  lamp  to 
light,  press  the  left  key;  if  you  expect  the  right 
lamp  to  light,  press  the  right  key;  if  you  are  not 
sure,  guess.  Be  sure  to  make  your  choice  as  soon 
as  the  ready  signal  appears,  press  the  proper  key 
down  firmly,  then  release  the  key  before  the 
ready  signal  goes  off.  It  is  important  that  you 
press  either  the  left  or  the  right  key,  never  both, 
on  each  trial,  and  that  you  make  your  decision 
and  indicate  your  choice  while  the  signal  light 
is  on. 

"Now  we  will  give  you  four  practice  trials." 

At  this  point  the  overhead  lights  were  extin- 
guished and  the  recorder  started.  If  any  obvi- 
ous mistakes  were  made  by  S  during  the  four 
practice  trials,  they  were  pointed  out  by  E. 
During  the  four  practice  trials  the  reinforcing 
lights  were  always  given  in  the  order:  Ei,  Ei, 
E2,  E2.  After  the  practice  trials  the  following 
instructions  were  read; 


"Are  you  sure  you  understand  all  of  the  in- 
structions so  far.''  The  rest  of  the  trials  will 
have  to  be  run  off  without  any  conversation  or 
other  interruptions.  Please  make  a  choice  on 
every  trial  even  if  it  seems  difficult.  Make  a 
guess  on  the  first  trial,  then  try  to  improve  your 
guesses  as  you  go  along  and  make  as  many  cor- 
rect choices  as  possible." 

Questions  were  answered  by  rereading  or 
paraphrasing  the  appropriate  part  of  the  instruc- 
tions. If  there  were  any  questions  about  tricks 
the  following  additional  paragraph  was  read. 

"We  have  told  you  everything  that  will 
happen.  There  are  no  tricks  or  catches  in  this 
experiment.  We  simply  want  to  see  how  well 
you  can  profit  from  experience  in  a  rather  diffi- 
cult problem-solving  situation  while  working 
under  time  pressure." 

The  recorder  was  now  started  again  and  the 
240  experimental  trials  were  run  off  in  a  con- 
tinuous sequence  with  no  break  or  other  indica- 
tion to  S  at  the  transition  from  Series  A  to 
Series  B.  On  each  trial,  the  signal  lamps  were 
lighted  for  approximately  2  sec. ;  1  sec.  later  the 
appropriate  reinforcing  light  in  each  S's  booth 
lighted  for  .8  sec;  then  after  an  interval  of  .4 
sec.  the  next  ready  signal  appeared;  and  so  on. 
The  high  rate  of  stimulus  presentation  was  used 
in  order  to  minimize  verbalization  on  the  part  of 
Ss. 

Results  and  Discussion 

Terminal  response  probabilities. — It 
will  be  clear  from  our  discussion  of 
Equation  4  that  the  predicted  asymp- 
tote for  each  series  will  be  the  value 
of  Tf  obtaining  during  the  series.  We 
have  taken  the  mean  proportion  of  Ai 
responses  during  the  last  40  trials  of 
each  series  as  an  estimate  of  terminal 
response  probability,  and  these  values 
are  summarized  for  all  groups  and 
both  series  in  Table  2. 

TABLE  2 

Terminal    Mean    Response    Probabilities 
FOR  Each  Series 


Group 

Series  A 

Series  B 

P 

IT 

t 

P 

^ 

t 

I 

II 
III 

.37 
.48 
.87 

.30 
.50 
-.85 

2.35 
0.55 
0.55 

.28 
.37 
.30 

.30 
.30 
.30 

0.77 
2.56 
0.05 

F 

69.31 

2.98 
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For  the  first  series  a  simple  analysis 
of  variance  yields  an  F  significant 
beyond  the  .001  level  for  diff'erences 
among  means.  From  the  within- 
groups  variance  estimate  we  obtain  a 
value  for  the  standard  error  of  a  group 
mean,  and  this  is  used  in  the  t  test 
between  each  group  mean  and  the 
appropriate  theoretical  mean.  For 
the  second  series  the  between-groups 
F  has  a  probability  between  the  .05 
and  .10  levels.  In  neither  series  were 
differences  among  subgroup  means 
significant  at  the  .05  level. 

The  interpretation  seems  straight- 
forward. Group  III  approximates 
the  theoretical  asymptote  in  both 
series.  Group  I  falls  significantly 
short  of  the  theoretical  asymptote  in 
the  first  series  but  approximates  it  in 
the  second  series.  Group  II  falls  sig- 
nificantly short  of  the  theoretical 
asymptote  in  the  second  series,  but 
reaches  the  same  probability  level  as 
had  Group  I  in  the  first  series.  Of  the 
t  tests  computed  for  differences  be- 
tween the  last  two  blocks  of  20  trials 
in  each  series,  all  yielded  probabilities 
greater  than  .10  except  thef  for  Series 
IIb  which  was  significant  at  the  .02 
level.  Evidently  the  predictions  con- 
cerning mean  asymptotic  values  are 
correct,  but  the  rate  of  approach  to 
asymptote  is  faster  with  Group  III 
than  under  the  other  conditions. 

According  to  theory,  not  only  group 
means,  but  also  individual  curves 
should  approach  tt  asymptotically. 
To  obtain  evidence  as  to  the  tenability 
of  this  aspect  of  the  theory  we  have 
examined  the  distributions  of  indi- 
vidual Ai  response  proportions  for  the 
last  40  trials  of  Series  IIIa,  IIIb,  and 
^B.  If  all  individual  p  values  approxi- 
mate the  theoretical  asymptotes  over 
these  trials,  then  for  each  of  the  series 
the  individual  response  proportions 
should  cluster  around  the  mean  value, 
TT,    with    an    approximately   binomial 
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BLOCKS     OF    20    TRIALS    (m) 

Fig.  1.  Empirical  and  theoretical  curves  rep- 
resenting mean  proportion  of  Ei  predictions  (Ai 
responses)  per  20-trial  block  for  each  series 

distribution.  Taking  the  theoretical  <r 
equal  to  •V407r(l  —  tt),  which  is  actu- 
ally a  slight  underestimate  of  the  true 
value,  we  find  that  approximately  half 
of  the  scores  in  each  series  fall  within 
one  a  of  the  theoretical  asymptote  and 
only  one  score  in  each  series  deviates 
by  more  than  three  a.  It  appears, 
then,  that  except  for  a  few  widely  devi- 
ant cases  the  p  values  of  individual  Ss 
approach  the  theoretical  asymptote. 

One  might  raise  a  question  as  to  just 
what  is  meant  by  the  asymptote  of  an 
empirical  curve  in  a  situation  of  this 
kind.  Naturally  one  would  not  expect 
the  Ss  to  perform  at  constant  rates 
indefinitely.  It  does  not  seem  that 
any  sort  of  breaking  point  was  ap- 
proached in  the  present  study,  how- 
ever; one  subgroup  of  Group  I  was 
run  for  an  additional  60  trials  beyond 
Trial  240  and  maintained  an  average 
proportion  of  .304  Ai  responses  over 
these  trials. 

Mean  learning  curves. — In  Fig.  1 
mean  data  are  plotted  in  terms  of  the 
proportion  of  Ai  responses  per  block  of 
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20  trials.  The  theoretical  function 
which  should  describe  these  empirical 
curves  is  readily  obtained  from  Equa- 
tion 4.  Letting  m  be  the  ordinal  num- 
ber of  a  block  of  20  trials  running  from 
Trial_«  +  1  to  Trial  n  +  20  inclusive, 
and  P(m)  the  expected  proportion  of 
Ai  responses  in  the  block,  we  can  write 

P(m)  =  T 

[7r  -  p(0)](l  -  g)a''("'-i) 

20  6 

[1  -  (1  -  er-]  (5) 

this  expression  being  simply  the  mean 
value  of  p  (n)  over  the  wth  block  of  20 
trials.  According  to  theory,  Equa- 
tion 5  should  describe  each  of  the 
mean  curves  of  Fig.  1  once  numerical 
values  are  substituted  for  the  param- 
eters X,  p{0),  and  6;  furthermore,  the 
value  of  6  required  should  not  differ 
among  groups  within  either  series  and 
should  be  constant  from  series  to  series 
for  each  group.  The  values  of  t  are 
of  course  fixed  by  the  experimental 
procedure.  The  values  of  p  (0)  in  the 
first  series  should  be  in  the  neighbor- 
hood of  .50,  but  for  groups  of  size  16 
sampling  deviations  could  be  quite 
large  so  it  will  be  best  to  get  rid  of 
p{0)  in  favor  of  P(l)  which  can  be 
measured  more  accurately.  To  do 
this  we  write  Equation  5  for  m  =  1 

P(l)  =  T 

Cx-p(O)] 


20  6 


[1  -  (1  -  en 


then  solve  for  [x  —  p(0)]] 

20e[x-P(l)] 


Lr  -  p(0)]  = 


1  -  (1  -  6y 


and  substitute  this  result  into  Equa- 
tion 5  giving 

P{m)  =  V 

-  [tt  -  P(l)](l  -  &)2''('»-i).     (6) 

Observed  values  of  P(l)  turn  out  to 
be  .58  and  .59  for  Series  Ia  and  IIIa, 


respectively.  Now  we  lack  only  em- 
pirical estimates  of  6  and  these  can  be 
obtained  by  a  simple  statistical  pro- 
cedure. The  method  we  have  used  is 
to  sum  Equation  6  over  all  values  of 
m,  obtaining  for  K  blocks  of  trials 


T.  -^-b^-  P(i)](i  -  ^)20('»-i) 

"•=1        =  Kir-lir-  P(l)] 
[1  -  (1  -  fl)2o^] 


1  -  (1  -  6y 


(7) 


then  equate  Equation  7  to  the  sum  of 
the  observed  proportions  for  a  given 
series  and  solve  for  6.  For  Group  I 
we  obtain  the  estimate  6  =  .018  and 
for  IIIa,  6  =  .08.  Using  these  param- 
eter values  we  have  computed  the 
theoretical  curves  for  Group  I  and  for 
the  first  series  of  Group  III,  which 
may  be  seen  in  Fig.  1.  In  this  anal- 
ysis we  find  agreement  between  data 
and  theory  in  one  respect  but  not  in 
another.  The  theoretical  curves  pro- 
vide reasonably  good  descriptions  of 
the  observed  points,  especially  in  the 
case  of  Group  I,  but  the  6  values  for 
the  two  groups  are  by  no  means  equal. 
The  latter  finding  does  not  come  as  a 
surprise  inasmuch  as  we  had  found  in 
the  previous  section  that  Group  I  was 
significantly  short  of  its  theoretical 
asymptote  in  the  first  series,  while 
Group  III  was  not. 

We  did  not  try  to  estimate  a  6  value 
for  the  first  series  of  Group  II  since  the 
empirical  curve  is  virtually  horizontal 
and  closely  approximates  the  line 
P(w)  =  r  =  .50.  We  could  proceed 
to  estimate  6  values  for  Series  II b 
and  IIIb  by  the  method  used  above, 
but  it  will  be  of  more  interest  to  con- 
struct predicted  curves  for  these  series 
without  using  any  additional  informa- 
tion from  the  data.  According  to  the 
theory,  it  should  be  possible  to  com- 
pute those  curves  from  information 
already  at  our  disposal.  The  p(0) 
values  in  the  second  series  should  be 
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TABLE  3 

Predicted  and  Observed  Mean  Frequencies 

OF  THE  Ai  Response  in  the 

Second  Series 


Group 

Observed 

Predicted 

/ 

I 

II 
III 

37.19 
46.00 
42.75 

37.74 

45.94 
42.86 

0.22 
0.02 
0.04 

F 

3.16  ip>.05) 

the  theoretical  asymptotes  of  the  first 
series,  or  .50  and  .85  for  Groups  II 
and  III,  respectively.  The  only  pro- 
cedural difference  between  Ia  and  lis 
lies  in  the  number  of  preceding  rein- 
forcements; according  to  the  statis- 
tical   model,    however,    this    variable 


will  be  expected  to  have  no  effect 
except  insofar  as  it  leads  to  a  change  in 
p(0),  so  except  for  sampling  error  the 
6  value  estimated  for  Group  I  should 
be  applicable  to  lis-  Using  .50,  .30, 
and  .018  as  the  values  of  p{0),  it,  and 
d,  respectively,  we  have  computed  a 
theoretical  curve  for  Series  II  b,  and 
this  is  plotted  in  Fig.  1.  Similarly,  the 
d  value  estimated  for  Series  III  a  should 
apply  also  to  IIIb,  and  we  have  used 
this  value,  .08,  together  with  .30  for  ir 
and  .85  for  p(0)  to  compute  the  pre- 
dicted curve  for  IIIb  shown  in  Fig.  1. 
Considering  that  no  degrees  of  free- 
dom in  the  Series  B  data  have  been 
utilized  in  curve  fitting,  the  corre- 
spondence between  the  theoretical  and 
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Fig.  2.     Empirical  and  theoretical  cumulative  response  curves  for  individual  Ss  of  Group  I 
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empirical  curves  does  not  seem  bad. 
The  reason  for  some  of  the  irregulari- 
ties will  be  brought  out  in  the  next 
section.  A  statistical  test  of  one 
aspect  of  the  correspondence  can  be 
obtained  by  calculating  for  each  theo- 
retical curve  a  predicted  mean  total 
of  Ai  responses  in  the  second  series, 
by  means  of  Equation  5,  and  com- 
paring these  values  with  the  observed 
mean  totals.  This  has  been  done  and 
the  comparison  is  given  in  Table  3. 
The  t  values  for  differences  between 
observed  and  theoretical  values  seem 
satisfactorily  low. 

In  order  to  give  an  idea  of  the  extent 
to  which  the  behavior  of  individual  Ss 
conforms  to  the  theoretical  function, 
we  have  plotted  in  Fig.  2  the  indi- 
vidual cumulative  response  curves  for 
all  Ss  of  Group  I.  The  cumulative 
form  was  chosen  for  the  smoothing 
effect,  some  of  the  noncumulative 
curves  being  too  irregular  for  curve- 
fitting  purposes.  The  theoretical 
curves  in  Fig.  2  represent  Equation  7 
with  6  values  obtained  by  a  method  of 
approximation.  Ten  of  the  curves  are 
fitted  quite  well  by  this  function  with 
T  =  .30  as  the  asymptote  parameter. 
Four  curves,  Numbers  2,  11,  IS,  16, 
require  other  values  for  this  param- 
eter, viz.,  .075,  .45,  .24,  and  .18,  respec- 
tively. Curves  3  and  4  deviate  con- 
siderably from  the  theoretical  form. 
In  general,  it  appears  that  the  empir- 
ical curves  for  most  individual  Ss  can 
be  described  quite  satisfactorily  by 
the  theoretical  function,  and  this  fact 
gives  us  some  basis  for  inferring  that 
in  this  situation  mean  learning  curves 
for  groups  of  Ss  reflect  the  trend  of 
individual  learning  uncomplicated  by 
any  gross  artifacts  of  averaging. 

The  effect  of  120  reinforcements  at  a 
IT  value  of  .50  may  be  evaluated  by 
comparing  curve  forms  and  mean  Ai 
response  totals  for  Series  Ia  and  IIb- 
We  find  that  the  reinforcements  lead 


to  no  increase  in  resistance  to  change. 
Slopes  of  the  two  curves  are  very 
similar  and  the  response  totals  do  not 
differ  significantly.  This  result  is  in 
line  with  predictions  from  the  statis- 
tical model,  but  a  little  surprising, 
perhaps,  from  the  viewpoint  of  Thorn- 
dikian  or  Hullian  reinforcement  the- 
ory since  partial  reinforcement  has 
generally  (6)  been  held  to  increase 
resistance  to  extinction  in  this 
situation. 

The  conclusions  from  our  study  of 
the  mean  learning  curves  would  seem 
to  be  {a)  that  under  some  circum- 
stances at  least  it  is  possible  to  evalu- 
ate theoretical  parameters  from  the 
data  of  one  series  of  learning  trials  and 
then  to  predict  the  course  of  learning 
in  a  new  series ;  and  (b)  that  the  rate 
at  which  the  mean  learning  curve 
approaches  its  asymptote  depends,  in 
an  as  yet  incompletely  specified  man- 
ner, upon  the  difference  between  initial 
response  probability  and  the  proba- 
bility of  reinforcement  obtaining  dur- 
ing the  series. 

Sequence  effects. — The  mean  curves 
studied  in  the  preceding  section  may 
not  reflect  adequately  all  of  the  learn- 
ing that  went  on  during  the  experi- 
ment. The  irregularities  in  some  of 
the  mean  curves  of  Fig.  1  might  be 
accounted  for  if  there  is  a  significant 
tendency  for  Ss'  response  sequences  to 
follow  the  vagaries  of  the  sequences  of 
El's  and  E2's.  To  check  on  this  possi- 
bility we  have  plotted  in  Fig.  3  the 
mean  proportions  of  Ai  responses  vs. 
frequencies  of  Ei  occurrences  per  10- 
trial  block  for  all  groups  in  Series  B. 
In  preparing  this  graph,  the  120  trials 
of  Series  B  were  divided  into  12  suc- 
cessive blocks  of  10.  Since  there  were 
48  Ss,  there  were  576  of  these  trial 
blocks  and  they  werd  classified  accord- 
ing to  the  number  of  Ei  occurrences 
in  a  block.  Then  for  the  set  of  all 
blocks  in  which  no  Ei's  occurred,  the 
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mean  proportion  of  Ai  responses  was 
computed  and  entered  as  the  first 
point  in  Fig.  3,  and  so  on  for  the 
remaining  points.  It  seems  clear  that 
Ss  were  responding  to  the  particular 
sequences  of  Ei's  and  E2's,  and  not 
simply  to  the  over-all  rate.  Corre- 
sponding graphs  for  the  three  groups 
in  the  first  series  had  somewhat  shal- 
lower slopes;  they  have  not  been 
reported  since  some  of  the  individual 
points  were  based  on  too  few  cases  to 
be  reliable  and  the  groups  could  not 
be  averaged  together  in  the  first  series 
owing  to  the  different  ir  values. 

In  order  to  deal  statistically  with 
this  apparent  dependence  of  response 
tendency  upon  the  density  of  Ei  occur- 
rences in  the  immediately  preceding 
sequence,  we  have  computed  for  each 
series  the  average  probability,  PAiIE,, 
that  an  Ai  occurs  on  Trial  n  given  that 
an  El  occurs  on  Trial  n  —  \  and  the 
average  probability,  PAjIEj,  that  an 
Ai  occurs  on  Trial  n  given  that  an  E2 
occurs  on  Trial  n  —  \.  The  differ- 
ence between  these  two  quantities  can 
be  shown  to  be  proportional  to  the 
point  correlation  (7)  between  A(w) 
and  E(%  —  1)  for  a  given  series. 
Furthermore  our  Equations  1  and  2 
may  be  regarded  as  theoretical  expres- 
sions for  the  two  conditional  proba- 


o  y 
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TABLE  4 

Mean  Differences  between  Observed 
Values  of  ^AiE,  and  ^AiEj 


FREQUENCY 


FOR  Each  Series 

Series 

Group  I 

Group  II 

Group  III 

A 
B 

.128 
.214 

.199 
.294 

.153 

.231 

Fig.  3.  Mean  proportion  of  Ei  predictions 
(Ai  responses)  in  a  block  of  ten  trials  plotted 
against  the  actual  number  of  Ei  occurrences  in 
the  block;  data  averaged  for  all  groups  in  Series 
B 


bilities,  ^a^iEi  and  ^a^iej,  respectively, 
and  it  will  be  seen  that  if  these  expres- 
sions are  averaged  over  all  values  of  n 
in  a  series  and  the  second  subtracted 
from  the  first,  the  difference  is  equal 
to  the  parameter  0,  i.e., 

(1  -d)v{n)  +e-  {\  -e)v{n)  =e. 

Thus  from  the  statistical  model  we 
must  predict  that  the  difference  be- 
tween empirical  estimates  of  these 
conditional  probabilities  for  any  series 
should  be  positive  and,  if  successive 
trials  are  independent,  this  difference 
should  be  equal  to  the  value  of  6  esti- 
mated from  the  mean  response  curve. 
The  conditional  probabilities  have 
been  computed  from  the  data  for  each 
S  and  mean  differences  by  groups  are 
summarized  in  Table  4. 

All  of  the  differences  are  positive 
and  significant  at  better  than  the  .001 
level  of  confidence.  The  differences 
among  group  means  are  insignificant 
for  both  series  (/''s  equal  to  .45  and 
.73,  respectively)  as  are  differences 
among  subgroup  means.  The  increases 
from  the  first  series  to  the  second  are, 
however,  significant  beyond  the  .005 
level.  The  latter  effect  was  not  antici- 
pated on  theoretical  grounds ;  the 
most  plausible  explanation  that  has 
occurred  to  us  is  that  alternation 
tendencies  associated  with  previously 
established  guessing  habits  extin- 
guished during  the  early  part  of  the 
experiment.  This  hypothesis  would 
also  account  for  the  high  P(l)  value 
observed  for  Group  I  in  Fig.  1. 

Although  all  of  the  quantities  in 
Table  4  are  positive  and  apparently 


352 


READINGS  IN   MATHEMATICAL  PSYCHOLOGY 


independent  of  tt,  as  required  by  the 
theory,  the  numerical  values  are  all 
larger  than  the  6  estimates  obtained 
from  mean  response  curves.  The  most 
straightforward  interpretation  of  this 
disparity  would  be  that,  owing  to  the 
short  intertrial  interval,  successive 
trials  are  not  independent  in  the  sense 
required  by  the  theoretical  model. 
Nonindependence  would  have  at  least 
two  immediate  consequences  in  so  far 
as  the  present  experiment  is  concerned. 
First,  stimulus  samples  drawn  on  suc- 
cessive trials  would  overlap,  and  the 
learning  that  occurred  on  one  trial 
would  affect  behavior  on  the  next  to  a 
greater  extent  than  random  sampling 
would  allow  for,  thus  increasing  PAiEj, 
and  decreasing  Pa^e^-  Second,  the  re- 
inforcing stimulus  of  one  trial,  Ei  or 
E2,  would  be  part  of  the  stimulus  com- 
plex effective  at  the  beginning  of  the 
next  trial.  If  this  interpretation  is 
correct,  then  more  widely  spaced  trials 
should  result  in  better  agreement  be- 
tween the  alternative  estimates  of  d 
and  also  in  reduction  of  the  depend- 
ence of  mean  learning  rate  upon  prob- 
ability of  reinforcement. 

Summary 

Learning  rates,  asymptotic  behavior,  and 
sequential  properties  of  response  in  a  verbal  con- 
ditioning situation  were  studied  in  relation  to 
predictions  from  statistical  learning  theory. 

Forty-eight  college  students  were  run  in  an 
individualized  modification  of  the  "verbal  condi- 
tioning" experiment  originated  by  Humphreys 
(4).  Each  trial  consisted  in  presentation  of  a 
signal  followed  by  a  left-hand  or  right-hand 
"reinforcing"  light;  S  operated  an  appropriate 
key  to  indicate  his  prediction  as  to  which  light 
would  appear  on  each  trial.  For  each  S  one  of 
the  lights,  selected  randomly,  was  designatedas 
El,  the  other  as  Eo.  On  the  first  series  of  120 
trials.  El  occurred  with  probability  .30,  .50,  and 
.85  for  Groups  I,  II,  and  III,  respectively.  On 
the  second  120  trials,  Ei  occurred  with  proba- 
bility .30  for  all  groups. 

Theoretical  predictions  were  that  mean  proba- 
bility of  predicting  Ei  should  tend  asymptoti- 
cally to  the  actual  probability  of  Ei,  both  during 
original  learning  and  following  a  shift  in  proba- 


bility of  reinforcement;  and  that  response 
probabilities  should  change  in  accordance  with 
exponential  functions,  learning  rates  (as  meas- 
ured by  slope  parameters)  being  independent  of 
both  initial  condition  and  probability  of  rein- 
forcement. 

The  statistical  criterion  for  approach  to  theo- 
retical asymptote  was  met  by  Group  I  by  the 
end  of  the  second  series  and  by  Group  III  in 
both  first  and  second  series.  In  the  second 
series,  Group  II  was  short  of  theoretical  asymp- 
tote but  reached  the  same  response  probability 
as  had  Group  I  during  the  first  series. 

Learning  rates  were  virtually  identical  for 
Group  I,  first  series,  and  Group  II,  second  series, 
indicating  that  resistance  of  response  probability 
to  change  is  not  altered  by  50%  random  rein- 
forcement in  this  situation.  Learning  rates  dif- 
fered significantly  among  groups  within  both 
series.  In  general,  learning  rate  was  directly 
related  to  difference  between  initial  response 
probability  and  probability  of  reinforcement 
during  a  series.  It  was  suggested  that  this  rela- 
tionship may  depend  upon  temporal  massing  of 
trials.  Not  only  group  means,  but  individual 
learning  curves  could  be  described  satisfactorily 
by  theoretical  functions. 

No  tendency  was  observed  for  Ss  to  respond 
to  a  series  as  a  whole.  On  the  contrary,  sensi- 
tivity to  effects  of  individual  reinforcements  and 
nonreinforcements  (Ei  and  E2  occurrences)  in- 
creased significantly  as  a  function  of  trials. 
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AN  INVESTIGATION  OF  SOME  MATHEMATICAL 
MODELS  FOR  LEARNING  ' 

CURT  F.  FEY  * 

University  of  Pennsylvania 


In  this  study  an  attempt  is  made 
to  determine  whether  the  results  of 
two  different  learning  experiments 
can  be  described  by  stochastic  models 
proposed  by  Bush  and  Mosteller 
(1955)  and  Luce  (1959)  without 
changing  the  model  parameters. 

The  merit  of  a  model  lies  in  its 
ability  to  describe  and  predict  data 
successfully  with  the  aid  of  a  mini- 
mum of  free  parameters.  For  any 
one  experiment  this  can  be  done  with 
several  models.  Consequently  a  more 
stringent  test  of  a  model  is  its  ability 
to  predict  the  fine  structure  of  the 
data  with  one  invariant  set  of  param- 
eters in  such  a  way  that  once  the 
values  of  the  parameters  are  deter- 
mined in  one  experiment  these  same 
parameters  can  be  used  to  predict 
the  outcome  of  another  experiment. 

Galanter  and  Bush  (1959)  previ- 
ously studied  parameter  invariance 
in  the  linear  model  of  Bush  and 
Mosteller  (1955).  Their  analysis 
showed  an  apparent  lack  of  parameter 
invariance  in  a  T-maze  situation,  but 
it  is  not  clear  whether  the  lack  of 
parameter  invariance  was  attributable 
to  a  basic  mechanism  in  the  model 
or  was  a  consequence  of  sampling 
errors  and  difficulties  in  estimating 
parameters. 

The  purpose  of  the  present  study 

I 
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is  to  investigate  this  question  of 
parameter  invariance  in  greater  de- 
tail. The  experimental  design  was 
improved  over  that  of  Galanter  and 
Bush  (1959)  by  running  only  one 
rather  than  three  trials  per  day,  and 
it  was  extended  to  provide  a  com- 
parison between  100%  reinforcement 
and  75%  random  reinforcement. 

The  Models 

The  two  models  used  in  this  paper  may 
be  designated  as  the  alpha  model  (Bush 
&  Mosteller,  1955)  and  the  beta  model 
(Luce,  1959).  Each  of  them  uses  linear 
transformations.  In  the  alpha  model 
the  linear  transformation  is  applied  to  the 
response  probability  p;  in  the  beta 
model    it    is    applied    to    the    quantity 

P/a  -  P). 

Both  of  these  models  are  stochastic, 
i.e.,  they  deal  with  probabilities  of  mak- 
ing responses.  The  models  are  path- 
independent:  the  response  probability 
on  a  given  trial  depends  only  on  the 
response  probability  and  the  outcome 
on  the  previous  trial. 

An  animal  in  a  T  maze  can  turn  either 
to  the  left  or  to  the  right  on  any  given 
trial.  The  models  state  that  if  on  one 
trial  5  makes  a  response  for  which  it 
gets  rewarded,  then  the  probability  of 
making  that  same  response  on  the  next 
trial  increases.  The  models  specify  the 
manner  of  these  changes. 

Let  pn  be  the  probability  of  going 
to  the  right-hand  side  (probability  of  an 
"error")  on  trial  n;  let  qn  =  ^  ~  pn', 
let  ai,  a2,  ;8i  and  jSa  be  nonnegative 
parameters  such  that  ai  and  jSi  are 
associated  with  reward  and  02  and  /S2  are 
associated  with  nonreward.  The  models 
can  then  be  defined  in  the  following 
way: 


This  article  appeared  in  /.  exp.  Psychol.,  1961,  61,  455-461.     Reprinted  with  permission. 
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Alpha  Model  Response 

pn+\    =  CLypn  left  tum 

pn+\  =  a^pn  right  turn 

g„+i  =  aig„  right  turn 

5„+i  =  a^qn  left  turn 


Mathematical  properties  of  the  alpha 
model  were  listed  by  Galanter  and  Bush 
(1959,  pp.  272-273)  for  the  special 
condition  that  a  left  response  is  always 
rewarded  and  a  right-hand  response  is 
never  rewarded,  (100:0).  For  the  beta 
model  the  mathematical  properties  have 
been  determined  by  Kanal  (1960,  1961), 
Bush,  Galanter,  and  Luce  (1959,  p.  387), 
and  Bush  (1960). 


Outcome 

Beta  Model 

reward 

t       -               ^" 

^""^           ^n  +  ^l(l- 

-    pn) 

nonreward 

t>           -                      ^" 

^"^^           /'n  +  ^.d- 

-    pn) 

reward 

-                 /5.^„ 

^"^^           ^.^„  +    1    - 

pn 

nonreward 

_                 ^.Pr. 

^""'       /3.p„+l- 

Pn 

ha 

Method^ 

Subjects. — The  5s  were  male  hooded  rats 
of  the  Long-Evans  strain,  from  Rockland 
Farms,  New  York  City,  New  York.  They 
weighed  about  75  gm.  on  arrival.  Eight  rats 
were  used  for  the  preliminary  experiment.  In 
the  main  experiment  63  rats  were  used,  but 
the  final  A^  =  50,  because  13  died  during  the 
experiment. 

'  For  details  see  Fey,  1960. 


TABLE  1 

Period  2,  100:0  Group 

Comparison  of  Statistics  from  the  First  35  Trials  of  Experimental  Group  with 
Corresponding  Model  Values  Calculated  with  pi  =  1,  ai  =  .858,  and 

aa  =  .955  FOR  ALPHA  MODEL  AND  pi   =  .97,  ^i  =  .952,  AND 

/32  =  .642,  FOR  Beta  Model 


Means 

Standard  Errors 

Statistic 

Exp. 

a  Model 

/3  Model 

Exp. 

a  Model 

0  Model 

Number  of  5s 

25 

100 

500 

Number  of  trials 

35 

35 

35 

Total  number  of  errors 

12.28 

12.28 

12.39 

.76 

.00 

.012 

Trial  of  last  error 

23.16 

22.10 

26.45 

1.00 

.25 

.028 

Trial  of  first  success 

6.88 

6.87 

6.49 

.48 

.00 

.024 

Number  of  RR  sequences 

7.48 

7.32 

7.01 

.56 

.00 

.014 

RL 

4.76 

4.85 

5.32 

.28 

.09 

.016 

LR 

3.80 

3.85 

4.41 

.28 

.10 

.014 

LL 

17.96 

17.98 

17.26 

.80 

.22 

.024 

Number  of  L  runs  of: 

Length  1 

2.00 

1.81 

1.86 

.20 

.10 

.012 

2 

.56 

.81 

.87 

.08 

.03 

.014 

3 

.44 

.51 

.57 

.08 

.03 

.006 

4 

.32 

.37 

.38 

.04 

.04 

.006 

5 

.08 

.28 

.30 

.04 

.02 

.004 

Number  of  R  runs  of: 

Length  1 

2.60 

2.70 

3.12 

.24 

.00 

.014 

2 

.80 

.81 

.83 

.16 

.00 

.008 

3 

.40 

.40 

.36 

.08 

.00 

.006 

4 

.24 

.27 

.28 

.04 

.00 

.008 

5 

.20 

.20 

.20 

.04 

.00 

.004 

Total  number  of  R  runs 

4.80 

4.96 

5.38 

.28 

.00 

.018 

Note. — Standard  error  of  the  mean  was  computed  from  range  approximation. 
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TABLE  2 

Period  2,  75:25  Group 

Comparison  of  Statistics  of  75 :  25  Experimental  Group  with  Corresponding 

Model  Values  Calculated  with  ^i  =  1,  ai  =  .858,  and  a2  =  .955  for 

Alpha  Model  and  pi  =  .97,  j3i  =  .952,  and  ^2  =  .642 

FOR  Beta  Model 


Means 

Standard  Errors 

Statistic 

Exp. 

a  Model 

e  Model 

Exp. 

a  Model 

(3  Model 

Number  of  animals 

25 

100 

200 

Total  number  of  errors 

15.92 

19.21 

19.43 

1.04 

.22 

.03 

Trial  of  last  error 

27.32 

32.71 

33.59 

1.04 

.21 

.045 

Trial  of  first  success 

7.52 

7.19 

7.56 

.64 

.16 

.07 

Number  of  RR  sequences 

10.24 

11.85 

11.81 

.88 

.25 

.045 

RL 

5.44 

7.09 

7.33 

.28 

.10 

.035 

LR 

4.68 

6.37 

6.75 

.32 

.09 

.035 

LL 

13.64 

8.69 

8.11 

1.12 

.24 

.08 

Number  of  L  runs  of: 

Length  1 

2.48 

3.63 

3.55 

.20 

.10 

.045 

2 

1.16 

1.62 

1.68 

.20 

.05 

.03 

3 

.40 

.77 

1.09 

.08 

.03 

.02 

4 

.24 

.41 

.52 

.04 

.02 

.015 

5 

.16 

.23 

.21 

.04 

.02 

.01 

Number  of  R  runs  of: 

Length  1 

2.88 

3.52 

3.70 

.20 

.10 

.035 

2 

1.16 

1.50 

1.63 

.12 

.05 

.03 

3 

.48 

.66 

.79 

.12 

.04 

.015 

4 

.36 

.54 

.47 

.08 

.03 

.02 

5 

.20 

.38 

.29 

.08 

.02 

.01 

Total  number  of  R  runs 

5.68 

7.36 

7.62 

.32 

.09 

.035 

Note. — The  model  parameters  were  estimated  from  the  100.0  group, 
range  approximation. 


Standard  error  was  computed  from 


I 


Apparatus. — The  T  maze  was  a  replica 
of  that  used  by  Galanter  and  Bush  (1959). 
It  consisted  of  a  straight  alley  runway  for 
pretraining  and  a  T  maze  for  the  main  experi- 
ment. The  T  maze  was  built  in  such  a  way 
that  the  crossbar  and  the  start  arm  of  the 
T  could  be  separated  and  a  goalbox  could  be 
hooked  to  the  stem  of  the  T,  thereby  changing 
the  maze  into  a  straight  runway.  The  maze 
was  built  of  plywood  with  a  removable  wire 
mesh  top  and  pressed  wood  doors.  The 
inside  of  the  stem  and  the  attachable  goalbox 
were  painted  medium  gray,  the  right  arm  was 
painted  light  gray,  and  the  left,  dark  gray. 
The  length  of  the  cross  arm  was  60  in.,  the 
length  of  the  stem  was  26  in.,  and  the  attach- 
able goalbox  was  10  in.  The  alleys  were  4  in. 
wide  and  the  walls  were  8  in.  high.  The 
starting  compartment  was  10  in.  long  with 
a  guillotine  door  on  the  maze  side  and  a 
hinged  door  on  the  outside.  Another  guillo- 
tine door  was  at  the  choice  point.  The  goal 
cups  were  placed  at  the  end  of  each  arm. 
The  metal  goal  cups  had  double  floors,  the 
bottom  part  contained  inaccessible  wet  food 


mash  to  balance  olfactory  cues,  and  the  top 
contained  the  reward  pellet. 

Procedure. — This  experiment  consisted  of 
three  parts:  (a)  preliminary  handling;  (&) 
straight  alley  pretraining;  and  (c)  T-maze 
learning. 

The  5s  were  kept  in  the  laboratory  for  23 
days  at  ad  lib.  food  and  water  and  were 
handled  daily.  Then  5s  were  deprived  of 
food  for  18,  21,  21i  21 1,  and  22  hr.  on  Days 
24,  25,  26,  27,  and  28,  respectively. 

The  pretraining  started  on  Day  29.  For 
the  remainder  of  the  experiment  5s  were  under 
18  hr.  food  deprivation  at  the  beginning  of 
each  daily  run.  They  were  fed  4  hr.  later 
for  a  2-hr.  period.  Water  was  always  available 
in  the  cages. 

The  5s  were  given  one  trial  per  day  of 
pretraining  on  the  straight  alley  runway. 
Pretraining  lasted  for  three  days. 

During  the  30  days  of  Period  1  of  the  T- 
maze  learning,  the  following  procedure  was 
adhered  to:  .038-gm.  pellet  was  deposited 
in  the  right  goal  cup;  nothing  was  placed  in 
the  left  goal  cup.     The  5  was  placed  in  the 
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startbox  and  the  startbox  door  was  raised. 
As  5  passed  the  choice  point,  its  door  was 
lowered.  The  5  was  left  in  the  maze  until  it 
ate  the  pellet,  until  it  investigated  the  goal 
cup  (on  the  nonrewarded  side),  or  until  3 
min.  were  up,  whichever  occurred  first. 

At  the  end  of  Period  1  5s  were  divided  at 
random  into  two  groups.  One  group  was 
always  rewarded  on  the  left  side  during  Period 
2,  and  the  other  was  rewarded  according  to  the 
following  schedule  obtained  from  a  random 
number  table  with  P(L)  =  0.75:  L  L  L  R 
LLRLLRLLLLLRLRLLLLLR 
LLRRRRLLLLL.  Period  2  lasted  for 
35  days. 

Estimation  of  parameters. — The  parameters 
of  the  alpha  model  were  estimated  in  the 
following  way :  The  initial  probability  pi  was 
taken  to  be  1.00.  The  other  two  parameters 
were  estimated  from  the  Period  2  data  of  the 
100:0  group  by  equating  the  observed  mean 
number  of  trials  before  the  first  success  and 
the  observed  mean  total  number  of  errors 
to  their  respective  expected  values. 

Initial  estimates  of  the  beta  model  param- 
eters were   determined   by   methods   similar 


to  those  used  for  finding  the  alpha  model 
parameters.  These  estimates  were  modified 
by  exploration  of  the  parameter  space  until 
the  response  probabilities  (Monte  Carlo 
computations)  were  similar  to  the  experi- 
mental data  of  the  100:0  group.  The  follow- 
ing criteria  were  used :  the  total  number  of 
errors  generated  by  the  model  had  to  match 
the  data,  and  a  plot  of  trial-by-trial  mean 
response  probabilities  produced  by  the  model 
had  to  appear  similar  to  the  corresponding 
plot  of  the  data. 

Results 

The  results  of  the  experiment  are 
summarized  in  Fig.  1  and  2  and  Tables 
1  and  2.  Figure  1  presents  the 
proportions  of  R  response  of  the  100 : 0 
group  during  Period  2  and  the  cor- 
responding curves  generated  by  the 
models.  Figure  2  depicts  the  same 
data  for  the  75:25  group  during 
Period    2.      Tables    2    and     1    give 
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Fig.  1.  Period  2,  Group  100:0.  Trial  by  trial  proportions  of  L  responses  made  by  25 
experimental  5s  (filled  circles) ;  generated  by  alpha  model  (smooth  line)  computed  with 
pi  =  1.00,  ai  =  0.858,  and  at  =  0.955;  and  generated  by  500  beta  model  Monte  Carlo  analogs 
(open  circles)  computed  with  pi  =  0.97,  0i  =  0.952,  and  02  =  0.647. 
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Fig.  2.  Period  2,  Group  75:75.  Trial  by  trial  proportions  of  L  responses  made  by  25 
experimental  5s  (filled  circles) ;  by  100  alpha  model  Monte  Carlo  analogs  (open  circles)  com- 
puted with  pi  —  1.00,  ai  =  0.858,  and  ai  =  0.955 ;  and  by  200  beta  model  Monte  Carlo  analogs 
(triangles)  computed  with  pi  =  0.97,  /3i  =  0.952,  and  /Sj  =  0.647.  (R  =  food  reward  is  in 
right  maze  arm,  otherwise  the  left  arm  is  baited.) 
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comparative  results  of  this  experiment 
and  corresponding  model  values.  A 
more  detailed  analysis  of  results  is 
presented  by  Fey  (1960). 

Discussion 

The  merit  of  a  mathematical  model 
of  learning  lies  not  so  much  in  describing 
the  data  of  any  one  experiment  with  the 
aid  of  parameters  estimated  from  that 
particular  experiment  as  in  its  ability 
to  represent  accurately  the  learning 
process  of  a  variety  of  different  experi- 
mental situations  using  the  same  set  of 
parameters.  In  other  words  once  the 
parameters  are  estimated  for  one  experi- 
mental situation  the  model  should  be 
able  to  predict  the  course  of  learning 
in  other  experiments.  Models  which 
will  handle  a  variety  of  experimental 
situations  with  the  same  set  of  parameters 
are  called  parameter  invariant. 


This  experiment  indicates  that  the 
models  under  consideration  fit  the  Period 
2,  100:0  group  data,  from  which  their 
parameters  were  estimated,  quite  well, 
but  the  fit  to  the  Period  2,  75 :  25  group 
data  (using  parameters  estimated  from 
the  Period  2,  100:0  group)  is  less  success- 
ful. Both  models  show  an  apparent  lack 
of  parameter  invariance  of  approximately 
equal  magnitude. 

Tables  1  and  2  might  give  the  impres- 
sion that  the  alpha  model  fits  the  data 
slightly  better  than  the  beta  model. 
This  conclusion  is  hardly  warranted  if 
the  magnitudes  of  the  differences  and  the 
methods  of  estimating  the  parameters 
are  considered.  The  alpha  model  param- 
eters were  determined  analytically ;  those 
of  the  beta  model  were  estimated  by 
Monte  Carlo  procedures.  Thus  the 
alpha  model  parameters  were  determined 
more  exactly  than  those  of  the  beta  model. 
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The  lack  of  long  runs  seems  to  be  a 
basic  difficulty  of  the  models.  This  is  of 
little  consequence  in  100:0  animal  learn- 
ing, but  it  does  seem  to  be  important  in 
partial  reinforcement  schedules  for  ani- 
mals as  well  as  in  human  choice  behavior 
(Derks,  1960).  This  lack  of  long  runs  is 
not  generally  manifested  in  mean  learn- 
ing curves,  but  only  in  a  sequential 
analysis  of  the  data. 

The  fact  that  the  75:25  "stat  rats" 
learn  more  slowly  than  the  experimental 
^s  is  not  as  serious  as  the  lack  of  long 
runs.  A  change  in  the  size  of  the  model 
parameters  will  correct  the  former  de- 
ficiency. The  data  indicate  that  by 
reducing  the  beta  model  parameters 
by  about  25%,  the  total  number  of 
errors  made  by  the  model  analogs  will 
match  those  of  the  experimental  5s 
for  the  75:25  group.  These  reduced 
parameters  decrease  the  fit  to  the 
100:0  group. 

The  slow  learning  of  the  75:25  model 
analogs  could  be  handled  by  specifying 
the  manner  in  which  the  parameters  are 
modified  when  the  schedule  changes  from 
100:0  to  75:25.  With  respect  to  the 
lack  of  perseverance,  no  small  change  in 
parameter  values  would  increase  the 
fit  of  model  to  data. 

Galanter  and  Bush  (1959)  noted  in 
three  of  their  experiments  that  the 
probability  of  turning  to  the  more  fre- 
quently rewarded  side  tended  to  decrease 
slightly  during  the  first  few  acquisition 
trials  before  it  began  to  rise.  This 
phenomenon  occurs  also  in  other  experi- 


BAITEO    ARM 
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♦  BAITED    ARM 
UNBAITED   ARM 


TRIALS    (N) 

Fig.  3.  Trial  by  trial  distribution  of  time 
spent  in  baited,  left  (filled  circles)  and  un- 
baited,  right  (open  circles)  maze  arm  by  20 
.Ss  of  Galanter  and  Bush  (1959)  Exp.  III. 
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Fig.  4.  Trial  by  trial  distribution  of  time 
spent  in  baited,  left  (filled  circles)  and  un- 
baited,  right  (open  circles)  maze  arm  by 
50  5s  of  Period  1. 

mental  situations  (Gibson  &  Walk, 
1956;  Jensen,  1960;  Kendler  &  Lach- 
man,  1958).  In  the  present  experiment, 
the  initial  dip  is  hardly  noticeable. 

A  look  at  the  time  the  rats  spent  in 
the  baited  and  in  the  unbaited  arms  of 
the  maze  (Fig.  3  and  4)  indicates  that 
initially  our  ^s  and  those  of  the  Galanter 
and  Bush  (1959)  Exp.  Ill*  were  removed 
more  quickly  from  the  unbaited  than 
from  the  baited  side  of  the  maze;  later 
in  the  experiment,  removal  occurred 
after  approximately  the  same  time 
interval  in  either  arm  of  the  maze. 
The  reason  for  this  is  found  in  the  criteria 
for  removing  5  from  the  maze:  5  is  left 
in  the  maze  until  it  investigates  the  food 
cup  on  the  unbaited  side,  until  it  eats 
the  pellet  on  the  baited  side,  or  until 
3  min.  are  up,  whichever  occurs  first. 
The  5s  investigate  the  food  cup  on  the 

*  The  data  plotted  in  Fig.  3  were  obtained 
from  the  original  protocols  of  the  experiment 
reported  by  Galanter  and  Bush  (1959). 
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baited  side  before  they  start  eating  tlie 
pellet.  In  fact,  5  may  take  a  pellet  in  its 
mouth,  drop  it,  and  not  eat  it;  thus 
investigation  of  the  food  cup  occurs 
before  the  eating  (Fig.  3  and  4). 

Should  5  initially  prefer  removal  from 
the  maze  to  eating  the  pellet,  the  non- 
rewarded  side  may  actually  be  more 
attractive  to  5  than  the  rewarded  side, 
since  5  has  to  stay  in  the  nonrewarded 
side  for  a  shorter  period  of  time.  The  Ss 
behave  initially  as  if  they  were  much 
more  interested  in  escaping  from  the 
maze  than  in  eating.  As  S  becomes 
accustomed  to  the  experimental  situa- 
tion, interest  in  escaping  decreases  and 
the  food  pellet  gradually  becomes  more 
attractive. 

This  explanation  can  handle  the  dip 
in  our  experiment,  but  it  fails  in  the 
case  of  other  experiments  such  as  a 
Skinner-box  situation. 

Summary 

This  paper  investigated  two  models  for 
learning:  a  linear  model  proposed  by  Bush 
and  Mosteller  and  a  nonlinear  model  devel- 
oped by  Luce.  Specifically,  an  attempt  was 
made  to  determine  whether  data  obtained 
from  two  different  experimental  situations 
could  be  described  by  the  models  without 
changing  the  parameters. 

Two  groups  of  rats  were  trained  in  a  T 
maze.  One  group  was  always  rewarded  with 
food  on  one  side;  the  other  group  received  a 
food  reward  with  probability  .75  on  one  side 
and  .25  on  the  other  side.  Model  statistics 
were  computed  for  both  groups,  using  param- 
eters estimated  from  the  group  that  was 
always  rewarded  on  the  same  side,  and 
compared  with  the  experimental  data. 

It  was  found  that  there  is  good  agreement 
between  the  models  and  the  data  of  the 
continuously  reinforced  group,  from  which 
the  model  parameters  were  estimated.  The 
fit  to  the  data  of  the  partially  reinforced 
group,  however,  leaves  something  to  be 
desired. 

Both  models  fit  the  data  about  equally 
well. 
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A  FUNCTIONAL  EQUATION  ANALYSIS  OF  TWO  LEARNING 

MODELS* 

La VEEN  KANALf 

GENERAL  DYNAMICS/eLECTRONICS 

ROCHESTER,  NEW  YORK 

One-absorbing  barrier  random  walks  arising  from  Luce's  nonlinear 
beta  model  for  learning  and  a  linear  commuting-operator  model  (called  the 
alpha  model)  are  considered.  Functional  equations  for  various  statistics  are 
derived  from  the  branching  processes  defined  by  the  two  models.  Solutions 
to  general- functional  equations,  satisfied  by  statistics  of  the  alpha  and  beta 
models,  are  obtained.  The  methods  presented  have  application  to  other 
learning  models. 

The  two-response,  two-event,  path-independent,  contingent  version  of 
a  number  of  stochastic  models  for  learning  is  given  by  the  equations 

...  _   \QiPn         with  probability  p„ 

W2?>n         with  probability  (1  —  Pn), 

where  Qi  and  Q2  represent  transition  operators,  and  p„  and  (1  —  p„)  are, 
respectively,  the  probabilities  of  responses  Ai  and  A2  on  trial  n.  A  linear 
model  discussed  by  Bush  and  Mosteller  [8]  is  obtained  when  the  operators 
in  (1)  are  defined  by  the  equations 

/2)  QlP"    =    ^iPn  (0    <    QJl    <    1), 

QzPn    =    CX2Pn  (0    <    Q!2    <     1)  • 

In  this  paper,  this  linear  model  is  called  the  "alpha"  model.  A  specialization 
of  the  nonlinear  "beta"  model  proposed  by  Luce  [13]  is  obtained  when  the 
operators  are  defined  by  the  equations: 

^^^         ^'^-^  ^  1  +  (T-  Dp.  '        ^'^l'^;        /3,>0;        0  9^p,9^1. 
In  terms  of  the  variable  v„  =  p„/{l  —  Pn)  the  transition  equations  for  this 

*Abstracted  from  portions  of  the  author's  doctoral  dissertation.  University  of  Penn- 
sylvania, June  1960.  The  author  is  indebted  to  Robert  R.  Bush,  his  dissertation  supervisor 
for  the  valuable  help  and  encouragement  received  from  him  and  to  R.  Duncan  Luce  for 
many  helpful  discussions  and  for  partial  support  from  an  NSF  grant. 

fFormerly  at  the  Moore  School  of  Electrical  Engineering,  University  of  Pennsylvania, 
Philadelphia,  Pa.  The  author  is  grateful  to  J.  G.  Brainerd,  S.  Gorn,  and  C.  N.  Weygandt 
of  the  Moore  School,  and  N.  F.  Finkelstein,  D.  Parkhill  and  A.  A.  Wolf  of  General  Dynamics 
for  their  encouragement. 

This  article  appeared  in  Psychometrika,  1962,  27,  89-104.    Reprinted  with  permission. 
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version 

of  the  beta  model  are 

(4) 
where 

Vn+l     —     ' 

with  probabihty  p„ 

with  probabihty  (1  —  p„), 

0  <  y  <  00 

;        /?,  >0;        i=l,2. 
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In  the  beta  model  response  probabilities  undergo  nonlinear  rather  than 
linear  transformations  from  trial  to  trial.  Since  the  probabilities  of  choice 
inevitably  enter  into  the  derivation  of  stochastic  properties  of  the  model, 
the  methods  generally  used  to  derive  properties  of  linear  learning  models 
do  not  apply  to  the  beta  model. 

Analytical  methods  applicable  to  both  the  alpha  and  beta  models  are 
presented  in  this  paper.  The  approach  used  is  to  consider  the  branching  process 
defined  by  the  decision  rules  of  the  two  models,  and  from  it  to  formulate 
functional  equations  for  various  statistics  of  interest.  Tatsuoka  and  Hostel- 
ler [151  used  a  functional  equation  approach  to  obtain  some  statistics  for 
the  alpha  model.  Their  techniques  differ  somewhat  from  those  presented 
here;  the  approach  developed  here  leads  to  a  unified  method  of  attack  for 
the  alpha  and  beta  models  and  can  be  extended  to  others. 

Some  Random  Walks  Arising  from  the  Beta  Model 

In  (4),  /?,  >  1  and  /3,  <  1  may  be  identified,  respectively,  with  reward 
and  nonreward  of  the  response.  If  response  Aj  is  never  rewarded  and  response 
Aa  is  always  rewarded  ^i  <  1,  (82  <  1-  If  both  responses  are  always  rewarded 
(81  >  1,  182  <  1.  If  neither  response  is  ever  rewarded  /3i  <  1,  182  >  1.  It  is 
shown  in  [11]  that  these  three  cases  lead  to  one-absorbing-barrier  (OAB), 
two-absorbing-barrier  (TAB),  and  two-refiecting-barrier  (TRB)  walks.  Rig- 
orous proof  of  the  nature  of  the  barriers  for  these  and  other  random  walks 
resulting  from  the  two-alternative,  two-outcome  beta  model  is  given  by 
Lamperti  and  Suppes  [12].  Only  the  OAB  beta  model  (j8i  <  1,  /Sz  <  1)  is 
considered  in  this  paper.  Except  for  the  case  when  a,-  =  1,  in  the  alpha 
model  either  response  diminishes  the  probability  of  response  Ai  ;  the  alpha 
model  is  a  one-absorbing-barrier  model. 

Functional  Equations  for  Statistics  of  the 
One-Absorbing  Barrier  Models 

The  OAB  alpha  and  beta  models  lead  to  an  asymptotic  distribution  of 
p„  which  has  all  its  density  at  p  =  0.  (Considering  response  A.  1  as  an  error 
on  the  part  of  organisms  which  are  learning,  this  means  that  all  organisms 
eventually  learn  not  to  make  errors).  Additional  information  about  the 
processes  is  obtained  from  various  statistics.  Following  the  work  of  Bush 
and  Sternberg  [9]  on  a  simple  single-operator  model,  the  statistics  considered 
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are  those  which  describe  the  rate  of  approach  to  the  asymptote,  such  as  the 
mean,  weighted  mean,  and  variance  of  the  rate  of  approach;  sequential 
statistics  concerning  runs  of  responses;  other  statistics,  such  as  those  de- 
scribing the  first  occurrence  oi  an  A2  response  (success)  and  the  last  occur- 
rence of  an  Ai  response  (failure).  Functional  equations  satisfied  by  these 
statistics  are  derived  by  considering  the  branching  processes  shown  in  Fig.  1. 
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The  Beta  Model  Lattice 
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For  the  analysis  which  follows,  a  sequence  a;i  ,  0:2  ,  •  •  •  ,  a;„  of  random 
variables  is  defined  such  that 

_   fl  if  response  Ai  occurs  on  trial  n 
\0  if  response  A 2  occurs  on  trial  n. 

The  random  variables  have  expectations  p„  . 

The  mean  number  of  Ai  responses 

In  terms  of  the  random  variables  x„  ,  the  total  number  of  Ai  responses 
in  N  trials  is  given  by  the  random  variable  X^v  =  ^^=1  a:„  with  expectation 
E(Xn)  =  ]^^=i  p„  .  In  the  one-absorbing  barrier  models,  both  responses 
decrease  the  probability  of  response  Aj,  and 

E{X)  =  \imE(X^) 

is  of  interest.  In  fact,  by  replacing  the  parameters  of  the  models  by  /3  = 
max  (|Si  ,  ^2)  and  a  =  max  (ai  ,  02)  finite  upper  bounds  for  E(X)  in  the  two 
models  are  obtained.  Now  the  number  X^  of  Ai  responses  in  A^"  trials  starting 
from  trial  1  will  be  equal  to  the  number,  Xat-i  ,  of  Ai  responses  in  (N  —  1) 
trials/ starting  from  trial  2  if  the  result  of  trial  1  is  an  A2  response  and  be 
equal  to  1  +  Xjv-i  if  the  result  of  trial  1  is  an  A.  1  response.  Letting  0  denote 
the  expected  number  of  Ai  responses,  the  functional  equations  for  <t>  are 
obtained  from  Fig.  1  to  be 

<t>^{v,  N)  =  pM  +  <t>,W,v,  A^  -  1)]  +  (1  -  p^)M^2V,  N  -  1) 

=  r^v  ^^^^^'''  ^  -  1)  +  1]  +  Y+~v  '^'^^'''  ^  -  1)' 
and 

<f>,{p,  N)  =  p[<f>M.P,  iV  -  1)  +  1]  +  (1  -  p)<i>a{cc2P,  N  -  1). 
When  N  -^  ^  these  equations  become 

(5)  uv)  =  Y^y  '^^(^i^)  +  r+^  '^^^^^''^  "^  r+^ ' 

(6)  <t>a{p)  =  p<t>MiP)  +  (1  -  p)<t>M2P)  +  p. 

Both  the  above  functions  must,  of  course,  satisfy  the  boundary  condition 
0(0)  =  0. 

The  second  moment  of  the  number  of  Ai  responses 

Letting  6  denote  E{X_^)  the  functional  equations  for  the  second  moment 
are  then,  as  A''  — >  °o , 

(7)       9,iv)  =  Y^^  e^(M  +  ]-qr^  UM  +  y^^  [i  +  20^(^1^)]^ 
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(8)  eM  =  V&MiV)  +  (1  -  P)0.{a2p)  +  p[l  +  2<f>MiP)]- 

6{0)  =  0  is  a  boundary  condition.  Finite  upper  bounds  exist  for  ^^3(2;)  and 
da{p)',  replacing  the  parameters  by  jS  =  max  {^1  ,  jSa)  and  a  =  max  (ai  ,  aa) 
the  variance  of  X^  is  y^Li  Pn(l  —  Pn)  which  remains  finite  as  iV  — *  «»  if 
y^T  Pn  does.  Functional  equations  for  higher  moments  are  easily  obtained 
in  this  manner. 

The  functional  equations  for  the  mean  and  second  moment  of  the 
number  of  Ai  responses  have  been  previously  obtained  by  Tatsuoka  [14] 
and  Tatsuoka  and  Mosteller  [15].  Their  method  of  derivation  is  somewhat 
different  from  that  presented  here. 

The  weighted  number  of  A 1  responses 
Define  the  random  variable 

n  =  l 

Then  Yq.n  represents  the  weighted  number  of  Ai  responses  in  N  trials  with 
the  weighting  function  being  the  trial  number  n.  From  trial  2  on,  the  weighted 
number  of  Ai  responses  is  ^^=2  ^^n  ,  which  by  relabeling  the  random  variables 
X2  ,  X3  ,  •  •  •  as  a^i  ,  ^2  ,  •  •  •  can  be  represented  by  the  random  variable 

n  =  l 

If  }p  stands  for  the  expectation  of  the  weighted  number  of  Ai  responses, 
the  functional  equations  are  obtained  by  noting  that  Fq.at  is  equal  to  Fi,Ar_i 
if  the  result  of  the  first  trial  is  an  A2  response  and  is  equal  to  (1  +  Yi,^m-i)) 
if  the  result  of  the  first  trial  is  an  ^1  response.  For  an  infinite  number  of  trials, 

(9)  Mv)  =  T~—  MM  +  7-4-  MM  +  Mv), 

i  -f-  V  1  -r  v 

(10)  ^„(p)     =    P^PMP)    +    (1     -    P)^(«2P)    +    (t>a(p). 

A  boundary  condition  is  ^(0)  =  0. 

Number  of  trials  before  the  first  A  2  response  (success)  occurs 

Let  Fi  -\-  1  denote  the  trial  number  on  which  response  A2  occurs  for 
the  first  time  so  that  Fi  is  the  number  of  trials  before  the  first  A2  .  Fi  is 
equal  to  zero  if  A2  occurs  on  the  first  trial  and  is  equal  to  (1  +  7^2),  where 
F2  denotes  the  number  of  trials,  before  the  first  A2  response  occurs,  starting 
at  trial  2,  if  trial  1  results  in  an  A 1  response.  Letting  v  denote  the  expectation 
of  the  random  variables  F,  the  functional  equations  for  v  are 

(11)      p,(v)  =  pMM  +  1]  +  [(1  -  pM  =  YJr-y'^(M  +  ]r^~v ' 
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(12)  v„{p)  =  VVaiotiV)  +  V- 

If  p  denotes  the  second  moment  of  the  random  variables  F,  the  functional 
equations  for  p  are 

(13)  pM  =  Y^  [1  +  2»',(M  +  P^(M] 


1  +  V  '^"^"^  ^   '   1  +v 

(14)  Pa(p)    =    PPaioCi    ,  V)    +   P[l    +    2j/(aip)]. 

TnaZ  number  at  which  last  Ai  response  occurs 
Let 

fO  if  no  Ai  response  occurs  on,  or  after  trial  n 
L„  =  ■^  1  if  the  last  Ai  response  occurs  on  trial  n 

[(N  +  1)  —  n  if  the  last  A^  response  occurs  on  trial  N  >  n. 

Then  the  random  variable  L^  represents  the  trial  number  at  which  the  last 
Ai  response  occurs,  and  by  definition  Li  is  zero  if  no  Ai  response  occurs 
on  any  trial.  In  the  following  development,  the  sequence  of  responses  A2A2A1 
denotes  the  occurrence  of  A  2  on  the  first  trial  followed  by  Ao  on  the  second 
trial  and  by  Ai  on  the  third  trial.  It  is  evident  that 


L2  +  1 

if  A I  occurs 

L3  +  2 

if  A2A1  occurs 

L4  +  3 

if  A2A2A1  occurs 

L5  +  4 

if  AiAzAzAi  occurs 

and  so  on. 

Letting  /x  denote  the  expectation  of  the  random  variables  L,  the  functional 
equation  for  Ma  is  developed  from  Fig.  1.  For  an  infinite  number  of  trials 

fiaip)   =  p[Ma(aip)  +  1]  +  (1   -  p)a2p[(Jia{aia2p)  +  2] 

+  (1  -  p){l   -  a2p)al'p[iJLMiOclp)  +  3]  +    •  •  • 

=  pnaiotip)  +  p  +  (1   -  p)a2p  +  (1   -  p)(l   -  ci2p)alp  +   •  •  • 

+  [(1   -  p)oi2p  +  2(1   -  p){\   —  a2p)a2p 

+  3(1   -  p)(l   -  a2p){l   -  a2p)alp  +   •  •  •  +  (1   -  p)a2PHa(.ocia2p) 

+  (1   -  p)(l   -  a2p)a2piJLc(oiia2p)  +    ■  •  •]. 

But  the  term  in  brackets  in  the  last  expression  is  just  (1   —  p)Ma(«2?>)  as 
may  be  deduced  from,  the  expression  for  Ma(p)-  Also 

p  +  (1  -  p)a2p  +  (1  -  p){l  -  a2p)a2p  +   •  •  •   =  1   -   JI  (1  "  «2P)- 
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A  similar  development  for  np{v)  results  in  the  functional  equations 

(15)    ,x,(v)  =  -^  n,iM  +  T-^  M^(M  + 1  -  n    ^ 


1  +  y  ^''^^^  ^    '    1  +  y  '^''^'  '  ^^\  (1  +  ^\v)  ' 

(16)  Ma(?>)     =    PfJLaiotlP)    +    (1     -    p)Ma(«2?>)    +    1     "     11    (l     "    ^2^) , 

1=0 

with  n(0)  =  0,  since  for  p  =  Ono  Ai  response  ever  occurs.  A  different  deriva- 
tion for  the  mean  of  the  trial  number  at  which  the  last  A,  response  occurs  in 
the  alpha  model  will  be  found  in  Tatsuoka  and  Hosteller  [15]. 

For  the  expectation  of  Ll  it  is  necessary  to  consider  the  expectations  of 
(L2  +  1)^,  etc.  Denoting  the  second  moment  of  the  random  variables  L 
by  7  the  functional  equations  for  7  are 

(17)  7,»  =  Y^^  y>(M  +  :i-^  y>m  +  [2,M  +  f\  ^  -  l]  , 
and 

(18)  7a(p)  =  v^Mv)  +  (1  -  vhMv)  +  [2m.(p)  +  n  (1  -  «2?>)  -  1 J  , 

with  7(0)  =  0.  Functional  equations  for  higher  moments  of  Lj  can  easily 
be  generated  in  the  above  manner. 

Number  of  runs,  of  length  j,  of  A^  responses 
The  sequence  of  responses 

Ai^jTix   •  •  •  A.iA.2 

> Y ' 

j  trials 

is  termed  a  run,  of  Ai  responses,  of  length  j.  Statistics  concerning  the  number 
of  runs  of  A^  responses  of  length  exactly  equal  to  ;,  and  of  length  greater 
than  or  equal  to  j  (j  =  1,  2  •  •  •),  are  of  interest.  Let  /?„,,  denote  the  number 
of  runs  of  length  j,  which  occur  between  trial  n  and  the  termination  of  the 
process.  The  total  number  of  runs  of  length  j  is  then  i^i ,,  .  From  the  branching 
process  of  Fig.  1  it  is  seen  that  Ri,j  =  i?„,,  +  5„,,+2  ,  where  5„,,+2  is  the 
Kronecker  delta  function.  Letting  o-,  denote  the  expectation  of  the  number 
of  runs  of  length  j,  the  functional  equation  for  aj^  is  developed  from  the  beta 
model  lattice  of  Fig.  1  (a) .  For  an  infinite  number  of  trials 

where  5;t,,  is  the  Kronecker  delta  function.  Substituting  <tj^{^iv)  for  part  of 
the  expression  gives, 

i 

<Jif>{v)  =  Pi<r,p{^,v)  +  (1  -  Pi)(Tie{fi2V)  +  IIPi[(l  -  P/  +  i)  -  P/  +  i(l  ~  P>+2)]. 
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A  similar  development  gives  the  functional  equation  for  o-,<,(p).  The 
functional  equations  are 

(19)  <tM  =  —J-  <r,.,{^^v)  +  -^  (r,,(M 

i.  -\-  V  1  -+-  y 


^  k\  Vi  +  ^rvW    1  +  /3rv  ' 


(20)  o-,„(p)  =  p(Ti„(aip)  +  (1  -  p)a,„(a2p) 

+  a.^^^-=^p'(l-2a;p  +  ary), 

with  (r,(0)  =  0. 

Number  of  runs  of  length  greater  than  or  equal  to  j 

Let  Tn,j  stand  for  the  number  of  runs  of  length  greater  than  or  equal 
to  j  oi  Aj  responses,  which  occur  from  trial  n  to  the  termination  of  the  process. 
Then  Ti ,,  denotes  the  total  number  of  such  runs.  Now,  for  an  infinite  number 
of  trials 

Ti.,  =  T„.,  +  5„.,-,,  (z  =  2,3,  •••). 

Letting  X,-  be  the  expectation  of  the  number  of  runs  of  length  >  j,  a 
development  similar  to  that  previously  outlined  gives 

1+1 

(21)  \M  =  p,x,ffiM  +  (1  -  pOXy^CM  +  (1  -  Pi.i)  n  Pi 

t  =  l 

id  —  1) 

(22)  \,M  =  V\Mv)  +  (1  -  v)hM,p)  +  (1  -  a\p)a,  ^'^  ^     >'. 

The  expectation  of  the  total  number  of  runs  of  Ai  responses  in  an  infinite 
number  of  trials  is  obtained  when  j  =  1.  Denoting  this  statistic  by  X, 

(23)  X,(.)  =  ^  X,(^,.)  +  p^  X,(M  +  ~il^)  ' 

(24)  X„(p)  =  p-kM^v)  +  (1  -  p)Xa(«2p)  +  p(l  -  cc,p), 

with  X(0)  =  0.  Additional  functional  equations  for  other  random  variables 
of  interest,  such  as  runs  of  A2  responses,  have  been  derived  in  [11]. 

General  Functional  Equations  for  the  One-Absorbing-Barrier  Models 

The  functional  equations  presented  for  statistics  of  the  beta  model 
have  the. general  form 
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(25)  Kv,  /?! ,  ^2)  =  Y^^  mv,  ^1 ,  ^2)  +  pqr^  mv,  /3x ,  ^2)  +  g{v,  ^, ,  0.) 

where 

0<y<oo,        0  <  13,  <  I,        0  <  132  <  1. 

The  term  g{v,  ^j,  ,  ^2)  is,  in  general,  different  for  each  statistic  considered. 
For  all  except  the  run  statistics 

giv,  /5i  ,  ^2)  >  0, 

(26)  ^(0,  ^1  ,  ^2)  =  0, 


For  these  statistics 
(27) 


Umg{v,0,  ,|S2)  >  1. 
/(O,  ^1  ,  ^2)  =  0, 

lim   J(V,  ;Si   ,  182)    =    « 


Equation  (26)  does  not  hold  for  the  run  statistics  and  the  boundary  conditions 
for  the  run  statistics  have  to  be  defined  separately. 

The  functional  equations  for  statistics  of  the  alpha  model  are  seen 
to  have  the  general  form 

(28)       y(p,  «!  ,  az)  =  pyioiip,  Q!i  ,  aa)  +  (1  -  p)y(a2P,  a,  ,  az)  +  z(p,  a,  ,  a^) 

where 

0<p<l,        0  <  «!  <  1,        0  <  a2  <  1. 

For  the  statistics  of  the  alpha  model 

/29N  ^(0,  «i  ,  «2)  =  0, 

z(l,  ai  ,  02)  >  0, 
and  the  boundary  conditions  for  all  the  statistics  considered  are 
(30)  y(0,  ai  ,  a2)  =  0 

and 

lim  y(p,  «!  ,  02)     is  finite. 

The  functional  equations  for  the  run  statistics  of  the  beta  model  differ 
in  nature  from  the  functional  equations  for  the  other  statistics  considered.  A 
discussion  of  the  functional  equations  for  the  run  statistics  is  presented  in  [11]. 

The  sections  which  follow  present  formal  solutions  to  (25)  and  (28) 
under  the  boundary  conditions  (27)  and  (30)  respectively.  Theorems  con- 
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cerning  existence,  uniqueness  and  other  properties  of  the  solutions  have 
been  proved  in  [11]  by  methods  similar  to  those  of  Bellman  [3].  Some  of  these 
theorems  are  stated  here  without  proof. 

On  the  junctional  equation  for  the  OAB  beta  model 

Writing  f{v,  0^  ,  /Sa)  simply  as  f{v),  (25)  takes  the  form 

(31)  m  =  r-—-^  mv)  +  -q—  mv)  +  giv), 


where 


and 


giv)>0,         ^(0)  =  0,         lim^(i;)>l, 


/(O)  =  0;         lim  m  = 


Further,  let  0  <  ^Si  <  1;  0  <  /32  <  1.  The  cases  (/3i  =  1,  jSs  <  1)  and  {0,  <  1, 
/Sz  =  1)  can  be  considered  separately. 

Existence  of  solution.  For  any  function  r(v)  define  the  operator  T  by 

(32)  T-r{v)  =  —^  r(0,v)  +  :p^  ri^-.v)  +  g(v). 

1  +  V  1.  -r  V 

Theorem  1. 

j{v)  =  lim  r-'giv) 

when  the  limit  exists. 

Theorem  2.     If  g{v)  is  a  monotone  increasing  function  of  v,  then  a  solution 
f{v)  exists  if 

Z  gi^'v) 

1=0 

is  finite  for  0  <  v  <  oo,  where  0  <  /3  =  max  (^Si  ,  (82)  <  1- 

As  almost  all  the  g{v)  occurring  in  the  beta  model  first-moment  equations 
are  monotone  increasing  functions  of  v  which  satisfy  the  conditions  of  Theorem 
2,  the  existence  of  the  mean  of  most  of  the  random  variables  introduced  for 
the  OAB  beta  model  is  assured. 
■  From  a  proof  similar  to  that  for  Theorem  2  it  follows  that  when  g{v) 

is  a  monotone  increasing  function  of  v, 

(33)  i:Sf(/3>)  <f(v)  <  j^giM, 

,=0  i=0 

B   wheni 

B  0,,  =  max  (iSi  ,  ^2)     and    0^.  =  min  (0,  ,  ^2)- 

I 
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Conti7iuity.  If  |  g{v)  \  is  bounded  in  0  <  y  <  oo,  the  solution  j{v)  is 
continuous. 

Monotonicity.  If  g{v)  is  a  monotone  increasing  function  of  v,  and  if 
/5i  >  1^2  ,  then  j{v)  is  a  monotone  increasing  function  of  v. 

Uniqueness.     The  solution  /(y)  is  unique  in  0  <  y  <  oo . 

On  the  junctional  equation  for  the  OAB  alpha  models 
For  the  functional  equation 

(34)  yip)  =  p-y(a,p)  +  (1  -  p)y{ot2p)  +  z{p) 

the  development  of  existence,  uniqueness  and  other  properties  of  the  solution 
is  similar  to  that  for  (31).  Some  properties  of  y{p)  are  stated  without  proof. 
Existence.     For  any  function  Q{p),  define  the  operator 

(35)  miv)  =  pQic^^v)  +  (1  -  p)Q(c^2p)  +  z(p) 
and  let 

lim  A^"^-z(p)  Ip^i  =  c(ai  ,  ag)- 

Theorem  3. 

y(p)  =  lim  A^"^-z(p). 

n-<co 

Theorem  4.     //  z{p)  is  monotone  increasing  in  p,  then 

CO  00 

Z)  ^(«nP)  <  yip)  <   Zl  zialp) 

»=0  i=0 

where 

a„  =  max  («!  ,  0:2),         oi„  =  min  (ai  ,  a2). 

Monotonicity.  If  zip)  is  monotone  increasing  in  p,  and  aj  >  0:2  ,  then 
yip)  is  monotone  increasing  in  p. 

Convexity.     If  2(7?)  is  convex  and  ai  >  aj  ,  then  yip)  is  convex. 

Solution  of  the  Functional  Equation  for  the  OAB  Beta  Model 

The  solution  to  (31)  is  obtained  by  generalizing  from  solutions  of  the 
equation  for  special  parameter  values.  The  parameter  space  of  the  OAB 
beta  model  is  shown  in  Fig.  2.  One  solution  for  special  parameter  values  is 
derived  here.  A  detailed  presentation  will  be  found  in  [11]. 

Theorem  5.     Along  sides  (1)  and  (2)  of  Fig.  2, 


(36)  fiv)  ==  z  E  dii^Tm  n 


^2^lV 


f/o  (1  +  /?2/3» 
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Figure  2 
Parameter  Space  of  OAB  Beta  Model 

Proof.  Along  side  (1),  182  =  0,  |8i  <  1,  and  only  the  n  =  0  term  of  the 
summation  over  n,  is  nonzero.  The  resulting  expression  is  the  one  obtained 
from  the  functional  equation,  for  in  this  case 


m 


1  +  V 


mv)  +  g(v), 


givmg 


/(/?» 


^7v 


(1  +  m 


K^r'v)  +  g{(3:v), 


for  m  =  0,  1,  •  •  •  ,  from  which  the  desired  result  is  obtained  by  successive 
substitution.  Along  side  (2),  /3i  =  1,  jSa  <  1,  and  (36)  becomes 


z  gm  z  n 


02V 


^0  f-4  (1  +  m 


rr  =  Ed  +  02v)g{02v), 
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the  last  expression  being  also  the  one  obtained  from  the  functional  equation 
for  the  case  /3i  =  1,  182  <  1,  for  which  case  the  functional  equation  reduces 
to  Kv)  -  KM  =  (1  +  v)g{v).  Q.E.D. 

Note  that  at  the  point  (1,  1)  of  the  parameter  space  the  solutions  diverge. 
By  letting  /81  =  /Sg  ,  jSg  =  /S*  (fc  =  1,2,  •  •  •),  solutions  along  arcs  of  the  form 
(6)  and  (7)  of  the  parameter  space  can  be  obtained.  The  resulting  functional 
equations  may  be  written  in  the  form  of  g-difference  equations  for  which 
there  exists  an  extensive  body  of  literature  [1,  2]. 

Examination  of  the  solutions  for  various  special  parameter  values  sug- 
gests the  form  of  the  general  solution.  The  general  solution  to  (31)  is  given 
by  the  following  theorem. 

Theorem  6. 

m=0     n=0 

where 

Ao,o(v)  =  1, 

Ao.M=Jlr^r^  (n=h2,-..), 

A^.M  =   X)  Ao.MA,.o(02v)A„_,,,,_,{^,^lv)  (m,n  =  1,2,  ■•  ■). 

k  =  0 

Proof.     Substitution  in  (31)  gives 

Z  E  A„,„iv)9(^Tm  =  rxT,  Z  E  A„,sM9{^r'm 

m  =  0     n=0  1     -]-    V    m=o     n  =  0 

+  T-XT  E  E  A^JMgi^T^r'v)  +  g(v) 
so  that 

(37)  E  A^,o{v)g{^7v)  =  g(v)  +  y^^  E  A^_,,o(M9(^:v), 
which  gives 

m  yo  t  —  1 

Ao.oiv)  =  1;        A„,o(v)  =  YJr,  A^-..o(M  =  n  YfW'v' 

(38)  E  Ao,Mgi02v)  =  -A-,  E  Ao,n-,{M9(m, 

n=l  1     ~\-    V     „=! 
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which  gives, 

1  "1 

and 

(39)    i:  i:  ^™.„(i;)^(/3r^^?;)  =  7^  i:  i:  A_,.„(^ii;)^(CT2t^) 

m-l     n=l  i    -t   V  „=i     „=i 

1  +  y  ^1  „=i 
The  coefficients  in  this  last  expression  satisfy  the  difference  equation 

from  which  follows  [11] 

k  =  0 

n  nk  k  I  n  —  k  -i 

_    yp        132V        JJ  L TT  L 

~  .4^  1  +  ^2V  M  (1  +  ^rv)  M  (1  +  /3i/3r*-V) ' 

k  =  0 

General  Solution  to  the  OAB  Alpha  Model  Functional  Equation 

Replacing  /Sg  and  /3i  by  az  and  ai  in  Fig.  2  gives  the  parameter  space  of 
the  OAB  alpha  model.  The  general  solution  for  (34)  can  be  derived  [11]  in 
a  manner  similar  to  that  used  for  the  beta  model  functional  equation.  The 
solution  is  given  by 

Theorem  7. 

yiv)  =    Z)  X)  6„,„(p)-2(a>2p), 

Tn  =  0     n  =  0 

where 

K.oip)  =  1, 

h^.oip)   -  p-'ar  ^^V   ^^  (m=l,2,   ...), 

&o.„(p)  =   fl(l  -«rp)  (n=  1,2,  •••), 

7=1 
n 

&m,„(p)  =    Z)  hi,o{a2p)bo,k{p)hrn-i.n-kiaia2p)  (m,n  =  1,2,  •••)• 
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Proof.  The  proof  is  similar  to  that  used  for  the  beta  model  equation. 
Details  are  given  in  [11]. 

Discussion 

Analytical  techniques  applicable  to  a  class  of  learning  models  have 
been  presented.  Functional  equations  for  various  statistics  of  two  learning 
models,  viz.,  Luce's  nonlinear  beta  model  and  a  linear  commuting-operator 
model  called  the  alpha  model,  have  been  derived  from  the  branching  processes 
defined  by  the  models. 

The  results  on  stochastic  properties  of  Luce's  beta  model  are  new.  For 
the  alpha  model,  power  series  solutions  to  the  functional  equations  for  the 
first  and  second  moments  of  the  total  number  of  A  i  responses  and  the  trial 
number  at  which  the  last  A^  occurs  had  been  obtained  by  Tatsuoka  and 
Hosteller  [15].  However,  the  techniques  of  expanding  the  functions  in  a 
power  series  in  the  variable  often  fails,  as  is  illustrated  by  the  fact  that  the 
power  series  solutions  (obtained  by  Tatsuoka  [14])  to  the  functional  equations 
for  the  first  and  second  moments  of  the  total  number  of  Ai  responses  for 
the  one-absorbing-barrier  (OAB)  beta  model  are  not  valid  for  y  >  1. 

By  investigating  two  general  equations,  the  problem  of  solving  the 
individual  functional  equations  for  the  OAB  models  was  simplified.  The 
functional  equations  for  the  sequential  statistics  of  the  OAB  beta  model 
do  not  have  the  same  boundary  conditions  as  the  general  equation  presented 
in  this  paper,  and  their  solutions  require  additional  investigation. 

Because  of  the  complexity  of  the  expressions  obtained  for  the  statistics 
of  the  OAB  models,  an  attempt  was  made  to  find  some  close  bounds  which 
could  be  easily  computed.  Some  upper  and  lower  bounds  for  statistics  of 
the  OAB  alpha  model  have  been  presented  in  ([11],  ch.  5).  An  upper  bound 
for  one  statistic  of  the  OAB  beta  model  has  also  been  derived  in  [11],  mainly 
to  illustrate  the  methods  used  to  obtain  upper  bounds  for  a  few  statistics 
of  the  OAB  beta  model.  These  methods  failed  for  a  number  of  the  statistics. 
Furthermore,  a  method  for  the  derivation  of  close  lower  bounds  for  the 
OAB  beta  model  remains  to  be  found. 

Empirical  tests  and  comparisons  of  the  beta  model  with  other  models 
have  been  presented  by  Bush,  Galanter,  and  Luce  [6]  and  Fey  [10].  The 
use  of  statistics  such  as  those  derived  in  this  paper  for  the  estimation  of 
parameters  and  for  measuring  the  goodness  of  fit  has  been  discussed  by 
Bush  and  Hosteller  [8],  Bush,  Galanter,  and  Luce  [6]  and  by  others  (see  [5]). 

REFERENCES 

[1]  Adams,  C.  R.  On  the  linear  ordinary  g-difference  equation.  Ann.  Math.,  2nd  ser.,  30, 

1929,  195-205. 
[2]  Adams,  C.  R.  Linear  g-difference  equations.  Bull.  Anier.  math.  Soc,  2nd  ser.,  37, 

1931,  361-400. 


LAVEEN  KANAL  375 

[3]  Bellman,  R.  On  a  certain  class  of  functional  equations.  In  T.  E.  Harris,  R.  Bellman, 
and  H.  N.  Shapiro  (Eds.),  Functional  equations  occurring  in  decision  processes.  Re- 
search memo.  RM-898,  Rand  Corp.,  Santa  Monica,  Calif.,  1952. 
[4]  Bush,  R.  R.  Some  properties  of  Luce's  beta  model  for  learning.   In  K.  J.  Arrow, 
S.  Karlin,  and  P.  Suppes  (Eds.),  Proceedings  of  the  First  Stanford  Symposium  on 
mathematical  methods  in  the  social  sciences.  Stanford,  Calif.:  Stanford  Univ.  Press, 
1960. 
[5]  Bush,  R.  R.  and  Estes,  W.  K.  (Eds.)  Studies  in  mathematical  learning  theory.  Stan- 
ford, Calif.:  Stanford  Univ.  Press,  1959. 
[6]  Bush,  R.  R.,  Galanter,  E.,  and  Luce,  R.  D.  Tests  of  the  beta  model.  In  R.  R.  Bush 
and  W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theory.  Stanford,  Calif.: 
Stanford  Univ.  Press,  1959.  Ch.  18. 
[7]  Bush,  R.  R.  and  Mosteller,  F.  A  comparison  of  eight  models.  In  R.  R.  Bush  and 
W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theory.  Stanford,  Calif.:  Stan- 
ford Univ.  Press,  1959.  Ch.  15. 
[8]  Bush,  R.  R.  and  Mosteller,  F.  Stochastic  models  for  learning.  New  York:  Wiley,  1955. 
[9]  Bush,  R.  R.  and  Sternberg,  S.  A  single-operator  model.  In  R.  R.  Bush  and  W.  K. 
Estes   (Eds.),   Studies   in  mathematical   learning   theory.    Stanford,    Calif.:   Stanford 
Univ.  Press,  1959.  Ch.  10. 
[10]  Fey,  C.  Investigation  of  some  mathematical  models  for  learning.  J.  exp.  Psychol,  61, 

1961,  455-461. 
[11]  Kanal,  L.  Analysis  of  some  stochastic  process  arising  from  a  learning  model.  Un- 
published doctoral  thesis,  Univ.  Penn.,  1960. 
[12]  Lamperti,  J.  and  Suppes,  P.  Some  asymptotic  properties  of  Luce's  beta  learning 

model.  Psychometrika,  25,  1960,  233-241. 
[13]  Luce,  R.  D.  Individual  choice  behavior.  New  York:  Wiley,  1959. 
[14]  Tatsuoka,  M.  Asymptotic  mean  and  variance  of  number  of  errors  for  the  beta  model. 

Unpublished  memo.  PC-11,  1958. 
[15]  Tatsuoka,  M.  and  Mosteller,  F.  A  commuting-operator  model.  In  R.  R.  Bush  and 
W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theories.  Stanford,  CaHf.:  Stanford 
Univ.  Press,  1959.  Ch.  12. 

Manuscript  received  13/3/60 

Revised  manuscript  received  7/23/61 


THE  ASYMPTOTIC  DISTRIBUTION  FOR  THE  TWO-ABSORBING- 
BARRIER  BETA  MODEL* 

La VEEN  KanalI 

GENERAL  DYNAMICS/eLECTRONICS 
ROCHESTER,  NEW  YORK 

For  the  two-absorbing-barrier  specialization  of  Luce's  beta  learning 
model,  the  asymptotic  distribution  of  the  response  probability  has  all  its  den- 
sity at  p  =  0  and  p  =  1.  The  functional  equation  for  the  amount  of  the 
density  at  p  =  1  is  investigated  in  this  paper. 

Luce's  beta  learning  model  [5]  for  the  two-response,  two-event,  con- 
tingent case  is  given  by  the  transition  equations 

(1)     «...  =  f "'"■•    '''*  P™bability  p.  0  <  »  <  » ,    ^,  >  0,    i  =  1 ,  2. 

W2Vn    with  probability  1  —  p„ 

where  p„  and  1  —  p„  are  respectively  the  probabilities  of  response  Ai  and 
response  A  2  ,  and  where  y„  =  p„/(l  —  p„).  In  a  companion  paper  [3]  statistics 
for  the  one-absorbing-barrier  (OAB)  beta  model  obtained  when  /3i  <  1, 
/?2  <  1.  are  derived.  In  this  paper  a  statistic  for  the  two-absorbing-barrier 
(TAB)  beta  model  arising  when  /Sj  >  1,  /Sz  <  1  is  presented.  Some  statistics 
for  the  two-reflecting-barrier  beta  model  are  considered  in  [4]. 

For  the  two-absorbing-barrier  beta  model  the  asymptotic  distribution 
of  p„  has  all  its  density  at  p  =  0  and  p  =  1.  The  amount  of  the  density  at 
p  =  1  is  a  useful  statistic  for  these  models.  If  f{v)  is  the  probability  that  a 
"particle"  starting  at  v  is  eventually  absorbed  at  -foo,  i.e.,  at  p  =  1,  the 
functional  equation  for  j(v)  is 

(2)  m  =  YT~,i(M  +  Y^^mv), 

where 

0  <  V  <  00 ,        /3i  >  1,        /32  <  1,        /(O)  =  0,        lim  f(v)  =  1. 

*Abstracted  from  a  portion  of  the  author's  doctoral  dissertation,  University  of 
Pennsylvania,  June  1960.  The  author  is  indebted  to  Prof.  B.  Epstein  and  to  Prof.  Robert  R. 
Bush,  his  dissertation  supervisor,  for  the  valuable  help  and  encouragement  received 
from  them. 

fFormerly  at  the  Moore  School  of  Electrical  Engineering,  University  of  Penn- 
sylvania, Philadelphia,  Pa.  The  author  is  grateful  to  the  Moore  School  for  the  support 
extended  to  him  during  his  doctoral  studies.  He  also  wishes  to  thank  D.  Parkhill  and 
N.  Finkelstein  of  General  Dynamics  for  their  encouragement  of  his  work. 

This  article  appeared  in  Fsychometrika,  1962,  27,  105-109.    Reprinted  with  permission. 
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The  solution  of  (2)  is  the  subject  of  this  paper.  Existence,  uniqueness,  and 
monotonicity  of  the  solution  are  shown  in  [4]  by  methods  similar  to  those 
of  Bellman  [1]. 

Solution  for  the  symmetric  model 

For  the  two-absorbing-barrier  symmetric  beta  model 

^.  =  ^  <  1. 

Let  X  =  log«  V  and  6  =  log,  ^i  .  Then  (2)  becomes 

(3)  fix)  =  Y^rrJix  +  6)  +  rxT^-^^^  -  ^)- 

^  The  solution  of  (3)  is  given  by  Theorem  1. 
Theorem  1*. 


%) 


Z  exp  |-^  [x  -  (k  -t-  1)6]  j 
Z^exp|-^[a:-(/c  +  ^)6r| 


Proof.  From  (3),  letting  g(x)  =  f(x)  —  f{x  —  h),  h(x)  =  log,  g(x), 
one  gets  h(x)  —  h{x  -}-  h)  =  x.  Assuming  h(x)  =  Cq  -\-  CiX  -\-  CaX^,  and  substi- 
tuting gives 


g(x)  =  p{x)  exp 


-^,  (x  -  6/2)^]  , 


26 
where  p(x)  is  a  periodic  function  of  period  b.  As 

fix  +h)  -  Kx)  =  g{x  +  h),        Kx)  =  f(x  +nh)  -   Y,  9(x  +  kh). 

k  =  l 

Then  as  n  -^  ^ ,  f(x  +  nb)  — >  1  and 


Kx)  =  1  -  p{x)  f:  exp  |-^  [x  +  (k-  m"j- 


Furthermore 


Kx  +  n6)  =  1  -  p(x)  ^E^  exp  |-^  [x  +  (k  -  1)6]  j  , 
and  letting  n  — >  —  «  gives 
p(x)  = 


E^exp|-^[:r  +  (^-i)6]j 


*Prof.  B.  Epstein  pointed  out  the  error  in  taking  limits  in  an  earlier  version  of 
Theorem  1  presented  by  Bush  [2]. 
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SO  that 


/(.r)  =  1  - 


|:exp|-^[:r  +  (fc-i)&]j 


from  which  Theorem  1  follows.  Note  that  /(O)  =  ^  as  the  symmetry  of  the 
problem  mdicates.  Q.E.D. 

Corollary  1.     When  x  ^^  —^,  i.e.,  for  large  negative  values  of  x, 


f{x)  =  p{x)  exp 


-56  (^  -  "Z'^']  ' 


as  p(.t)  is  of  period  h  and  the  term  corresponding  to  /c  =  0  dominates  in  the   j 
numerator. 

Corollary  2.     For  large  positive  x, 

fix)  =  1  -  p(x)  exp  [-^  (^  +  W^fj- 

Corollary  3.     When  6  <  4  the  denominator  of  Theorem  1  is  given  by 

_1 J2^ 

pix)  ^\  b   ' 

obtained  by  performing  a  fourier  series  analysis. 

Corollary  4.     When  6  <  4, 

for  then  by  Corollary  3,  the  denominator  of  Theorem  1  is  closely  approxi- 
mated by  a  constant  and  the  numerator  may  be  approximated  by  replacing 
the  sum  from  zero  to  infinity  by  an  integral  from  —1/2  to  infinity.  Using 
the  transformation 


Vb 


[i-'-i] 


gives  the  corollary. 

Solution  for  the  general  TAB  beta  model 

For  the  general  case  |Si  >  1,  (82  <  1,  it  is  convenient  to  obtain  the  solution 
in  terms  of  the  solution  for  the  symmetric  model.  Let  the  solution 
for  the  symmetric  model  given  in  Theorem  1  be  denoted  by  R(v).  Then  the 
solution  for  the  general  model  is  given  by  Theorem  2. 
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Theorem  2.    For  /3i  >  1,  /Sa  <  1 

^\Pl)     n  =  0    m=-0     it  =  0 

w/iere 


5„(/3,) 


n  (1  -  /32) 


(n+l)/2 


n  (1  -  »v) 

i  =  l 

Proof.     Define  the  transform 

F{s)  =    j     f(v)v-'-'  dv. 

Jo 


Writing  (2)  in  the  form 


(l  +  i)/(. 


')  =  mv)  +  lm.v), 


and  applying  the  transform  gives 

F{s)  +  F{s  +  1)  =  filF{s)  +  ^'r''F{s  +  1). 
If  R{s)  is  the  transform  of  R{v),  it  is  shown  in  [4]  that 

from  which,  by  expanding  the  numerator  and  denominator  terms  in  the 
product,  one  gets 

^\Pl)     n  =  0  A  =  0  m  =  0 

The  inverse  transform  of  ^^'''^rRi.s)  being  ^(/3i~*/3"i;),  taking  the  inverse 
transform  of  F(s)  gives  Theorem  2.  Q.E.D. 

It  is  noted  that  the  coefficients  in  the  series  of  Theorem  2  tend  to  zero 
rather  rapidly. 
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SOME  RANDOM  WALKS  ARISING  IN 
LEARNING  MODELS  I 

Samuel  Karlin 


Introduction 

The  present  paper  presents  an  analysis  of  certain  transition  operators  arising  in 
some  learning  models  introduced  by  Bush  and  Mosteller  [2].  They  suppose  that  the 
organism  makes  a  sequence  of  responses  among  a  fixed  finite  set  of  alternatives  and 
there  is  a  probability  /?"  at  moment  n  that  response  5  will  occur.  They  suppose  further 
that  the  probabilities /j^""*"^^  are  determined  by  the/?^,  the  response  5„  made  aftermoment 
n,  and  the  outcome  or  event  r„  that  follows  response  s„.  We  shall  examine  in  detail  the 
one-dimensional  models  which  occur  in  their  theory.  These  models  can  be  described 
in  simplest  form  as  follows :  There  exist  two  alternatives  A-^^  and  A^,  and  two  possible 
outcomes  r^  and  r^,,  for  each  experiment.  There  exists  a  set  of  Markofi^  matrices  F^j 
which  will  apply  where  choice  /  was  made  and  outcome  Vj  occurs.  Let/?  represent  the 
initial  probability  of  choosing  alternative  A^,  and  1  —  p  the  probability  of  choosing 
A^.  Depending  on  the  choice  and  outcome,  the  vector  {p,  1  —  /?)  is  transformed  by  the 
appropriate  F^  into  a  new  probability  vector  which  represents  the  new  probabilities 
of  preference  of  A^  and  A^,  respectively,  by  the  organism.  The  psychologist  is  interested 
in  knowing  the  limiting  form  of  the  probability  choice  vector  {p,  1  —  p). 

The  mathematical  description  of  the  simplest  process  of  this  type  can  be  form- 
ulated as  follows :  A  particle  on  the  unit  interval  executes  a  random  walk  subject  to 
two  impulses.  If  it  is  located  at  the  point  x,  then  x  -^  F-^x  =  ax  with  probability 
1  —  (j){x),  and  X  -^  F^x  =  1  —  a  +  oca;  with  probability  <ji{x).  The  actual  limiting 
behavior  of  x  depends  on  the  nature  of  <j>{x).  The  transition  operator  representing  the 
change  of  the  distribution  describing  the  position  of  the  particle  is  given  by 

rxjo  /'(:r-l  +  a)/a 

{TF)  {x)  =  \      [1  -  <t>{t)]  dF  +  m  dF. 

Jo  Jo 

We  introduce  an  additional  operator,  acting  on  continuous  functions,  and 
given  by 

U7T{t)  =  [1   -  <p{t)]TT{at)  +  <t>{t)TT{\  -  a  +  at). 

It  turns  out  that  T  is  conjugate  to  U;  hence  knowing  the  behavior  of  U  one  obtains 
much  information  about  T.  This  interplay  shall  be  exploited  considerably.  The 
operator  T  is  not  weakly  completely  continuous  nor  does  it  possess  any  kind  of  com- 
pactness property;  thus  none  of  the  classical  ergodic  theorems  apply  to  this  type  [3]. 
The  limiting  behavior  of  r"F  depends  very  sensitively  on  the  assumptions  made 
about  the  operators  Fj  and  the  probabilities  (t>{x). 

This  article  is  from  Pacific  J.  Math.,  1953,  3,  725-756.  Reprinted  with  permission. 
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Section  1  treats  the  case  where  (f)(x)  =  x.  This  causes  the  boundaries  0  and  1 
to  be  absorbing  states,  and  thus  the  hmiting  distribution  concentrates  only  at  these 
points.  However,  the  concentration  depends  on  the  initial  distribution.  By  examining 
the  corresponding  t/in  detail,  we  have  been  able  to  obtain  much  additional  knowledge. 
For  example,  we  have  shown  that  if  n  is  m  times  continuously  diflFerentiable  then 
(L'^"77)''''  converges  uniformly  for  each  0  <  r  <  m  —  1.  It  is  worth  emphasizing  that 
the  knowledge  of  the  convergence  of  the  distributions  does  not  imply  the  uniform 
convergence  of  U^tt  for  any  continuous  function  tt.  Additional  arguments  are  needed 
for  this  conclusion.  In  this  connection,  we  finally  remark  that  R.  Bellman,  T.  Harris, 
and  H.  N.  Shapiro  [1]  have  analyzed  only  this  case  independently.  They  did  not  point 
out  the  connection  between  the  operators  Tand  U.  The  methods  they  used  to  establish 
the  convergence  of  r"Fare  probabilistic.  Our  paper  in  §  1  overlaps  with  theirs  in  some 
of  the  theorems,  notably  6,  8,  9,  12,  and  15;  our  results  subsume  theirs,  and  their 
proofs  are  entirely  different  from  ours.  Section  2  considers  the  case  where  <f>{x)  is 
monotone  increasing  and 

\</>{x)  -cp{y)\   <«  <  1. 

This  leads  to  the  ergodic  phenomonon,  or  steady-state  situation,  where  the  limiting 
distributions  are  independent  of  the  starting  distributions. 

In  §  3,  we  examine  the  situation  <^(.t)  =  I  —  x.  This  corresponds  to  completely 
reflecting  boundaries,  and  of  course  the  ergodic  phenomenon  holds.  Other  interesting 
properties  of  the  operators  are  also  developed.  We  consider  in  §  4  the  case  where  (f)(x) 
is  linear  and  monotonic  decreasing.  Section  5  introduces  a  further  possibility  where 
we  allow  the  particle  to  stand  still  with  certain  probability.  This  type  has  been  statis- 
tically examined  by  M.  M.  Flood  [5].  In  §  6  we  investigate  the  general  ergodic  type 
where  ^(x)  is  not  necessarily  linear.  The  arguments  here  combine  both  abstract  analysis 
and  probabilistic  reasoning  involving  recurrent  event  theory.  Furthermore,  it  is  worth 
emphasizing,  the  proofs  given  in  §  6  apply  without  any  modifications  to  the  case  where 
we  allow  any  finite  number  of  impulses  acting  on  the  particle.  In  a  future  paper  we 
shall  present  the  extension  of  this  model  to  the  circumstance  where  changes  in  time 
occur  continuously  and  the  possible  motion  of  the  particle  has  a  continuous  or  infinite 
discrete  range  of  values. 

The  last  section  studies  some  of  the  properties  of  the  limiting  distribution  in  the 
ergodic  types.  It  is  shown  in  all  circumstances  that  the  limiting  distribution  is  either 
singular  or  absolutely  continuous,  and  the  actual  form  depends  on  the  value  of  a  +  a. 
Most  of  the  analysis  carries  over  to  higher  dimensional  models  where  more 
alternatives  are  allowed.  In  a  subsequent  paper  we  shall  present  this  theory  with  other 
generalizations.  We  finally  note  that  this  paper  represents  a  combination  of  abstract 
analysis  and  probability;  it  is  hoped  that  the  methods  used  will  be  useful  for  future 
investigations  of  this  type. 

It  has  been  brought  to  my  attention  by  the  referee  that  the  material  of  [6],  [7], 
[8],  and  [9]  relate  closely  to  the  content  of  this  paper.  Their  techniques  seem  to  be 
different. 

1 .  A  particle  undergoes  a  random  walk  on  the  unit  interval  subject  to  the  follow- 
ing law:  If  the  particle  is  at  x,  then  after  unit  time  x  ^  a.  -\-  {\  —  (x)x  with  probability 
X,  and  x  -*  ax  with  probability  1  —  x,  where  0  <  a,  0  <  1 .  If  F{x)  represents  the 
cumulative  distribution  describing  the  location  of  x  at  the  beginning  of  the  time  interval 
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with  the  understanding  that  F(x)  =  1  for  a;  >  1  and  F(x)  =  0  for  a;  <  0,  then  the  new 
distribution  locating  the  position  of  the  particle  at  the  end  of  the  time  interval  is  given 
by 


rxlo  f'{x-a)l{l-a) 

G(x)  =  TF  =\      (1-0  dF{t)  +  t  dF{t). 

Jo  Jo 


(1) 


Indeed,  the  probability  dG{x)  that  after  unit  time  the  particle  is  located  at  x 
can  materialize  in  two  ways;  namely,  the  particle  was  at  xja  and  moved  with  prob- 
ability 1  —  a?/ (T  to  a;,  or  it  jumped  with  probability  (a?  —  a)/(l  —  a)  from  (a;  —  a)/(l  —  a) 
to  X  during  the  unit  time  interval.   This  yields 

(x\         (x\        X  —  a.         (x  —  a\ 
'-;)''^(;;)+r^«''^(r^J- 

which  easily  implies  the  conclusion  of  equation  (1). 

Equation  (1)  represents  the  transition  law  for  the  particular  MarkofiF  process  on 
hand. 

The  transformation  Tis  easily  seen  to  furnish  a  linear  bounded  mapping  of  the 
space  of  functions  of  bounded  variation  (F)  on  the  unit  interval  into  itself.  Further- 
more, T  takes  distributions  into  distributions  and  is  of  norm  1.  This  section 
investigates  the  behavior  of  7""  for  large  n  with  the  aim  of  determining  limiting 
properties  of  T". 

We  consider  the  following  additional  mapping  U  applied  to  the  space  of  con- 
tinuous functions  defined  on  the  unit  interval  C(0,  1): 

(C/7r)(0  =  (1    -  t)7r{Gt)  +  t7r{a.  +  (1    -  a)^].  (2) 

The  operator  U  has  a  probabilistic  interpretation  which  we  shall  speak  about  later; 
but  its  direct  relevance  to  T  is  given  in  Theorem  1 .   The  inner-product  notation 

{^,F)  =  \     7r{t)dF{t) 

will  be  extensively  used. 

Theorem  1 .  The  conjugate  map  U*  to  U  is  T. 

Proof.  It  is  necessary  to  verify  that  (Utt,  F)  =  (77,  TF)  for  any  continuous 
function  n(t)  and  any  distribution  F(t)  with  F(t)  =  1  for  ?  >  1  and  F(t)  =  0  for  /  <  0. 
Indeed, 

(Un,  F)  =  {{1-  t)niat)  dF(t)  +     tn[oc  +  {I   -  a)t]  dF{t). 

By  a  change  of  variable,  we  get 


t  -  a. 
1   -  a 


'J 


77(0  dG(t)     where     G  =  TF. 


The  value  of  Theorem  1  is  that,  by  studying  the  iterates  of  t/",  we  deduce 
corresponding  results  about  the  conjugate  operators  T".    We  proceed  now  to  study 
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this  operator  U.  To  be  complete,  we  should  denote  the  operator  by  t/<j^,  but  where  no 
ambiguity  arises  we  shall  drop  the  subscripts.   Let  W  denote  the  isometry 

W^rit)  =  77(1  -  t). 

Clearly  W~^  =  W.   We  now  observe  the  identity 

^l-cc.l-.    =    ^^a..^-  (3) 

The  mapping  {a,  a)  -*  (1  —  a,  1  —  a)  of  the  parameter  space  into  itself  has  the  effect 
of  mapping  the  triangle  of  the  unit  square  bounded  above  byl  — a  —  ct=0  into  the 
other  triangle  located  in  the  unit  square.  This  isomorphism  property  enables  us  to 
restrict  our  attention  to  the  case  where  1  —  a  —  cr  >  0.  Corresponding  theorems  valid 
for  the  other  circumstances,  where  1  —  a  —  a  <  0,  are  deduced  easily  by  virtue  of  (3) 
and  will  be  summarized  at  the  end  of  this  section.  From  now  on  in  §  2,  unless  explicitly 
stated  otherwise,  we  shall  assume  that  1  —  a.  —  a  >0. 

The  next  two  theorems,  which  we  state  for  completeness,  are  immediate  from 
(2). 

Theorem  2.     77?^  operator  U  preserves  the  values  at  0  and  1. 

Theorem  3.  The  operator  U  is  positive;  that  is,  it  transforms  positive  con- 
tinuous functions  into  positive  continuous  functions. 

In  particular,  if  ni{t)  >  ir^it),  for  all  t,  then  t/77j  >  U-n^. 

Theorem  4.    If  tt,  77',  ... ,  7r<">  >  0,  then  Utt,  (Urr)',  .  .  . ,  (C/77)<«'  >  0. 

Proof.     A  simple  calculation  yields 

([/,,)(«)  =  (1   -  0(T"77<")(aO  +  ti\   -  a)"7T<")(a  +  (1   -  a)f) 

+  nil  -  a)"-i77<"-i'(a  +  (1  -  a)/)  -  na''-^J''-^\ot).     (4) 
Since 

at  <  t  <  (y.  +  (l   -  a.)t, 

we  conclude  since  7j-*""^'(0  is  monotonic  increasing  that 

77("-i)(a  +  (1  -  a)0  >  77<''-l>(0r)  >  0. 

The  assumption  that  \  -  a>  a  implies  that  (1  -  a)"-i  >  0^-1.  As  77-<'*'(0  >  0, 
it  follows  that  (UttY'^^  >  0.  The  same  conclusion  and  argument  apply  to  (UnY^^ 
for  0  </<«-!. 

In  particular,  U  transforms  positive  monotonic  convex  functions  into  functions 
of  the  same  kind.  Although  in  the  proof  of  Theorem  4  we  assumed  the  existence  of 
derivatives,  the  argument  can  be  carried  through  routinely  at  the  expense  of  elegance, 
by  use  of  the  general  definitions  of  convexity  and  monotonicity. 

Theorem  5.  If  c  >  tt^HO  >  0  for  0  <i  <  n,  then  (£/^7r)'^»(l)  <  Ki  for 
0  <i  <n  and  hence  (WTrY^Xt)  <  Ki. 

Proof.  The  proof  is  by  induction.  By  Theorem  2,  the  theorem  is  trivially 
true  for  /  =  0.  Suppose  we  have  established  the  result  for  the  /th  derivative  with 
0  <  /  <  «  —  1 .   Equation  (4)  yields 

(t/77)<"'(l)    -    77<">(1)     =    Ci(a)77<«-l)(l)     -    C2(a)77("-l)(ff)    +    [(1     -    «)"    -    1]77<")(1),         (5) 
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where  Ci(a)  and  CgCc)  are  constants  depending  only  on  a  and  c;  respectively,  and  on  n. 
If 

7r<«>(l)  >  M(a,  a,  c), 

where  M  is  a  constant  suflficiently  large,  then  (5)  yields 

(t/77)<"'(l)     <    77<«'(1). 

Since  c-^{a)  and  c^{a)  do  not  depend  on  k,  and  by  the  induction  hypotheses 

\{U^7tY-\x)\  <  M 

uniformly  in  k  and  x,  we  find  in  general  that  when  (t/''77)<"'(l)  becomes  larger  than 
M(a,  a,  c),  then 

((7^-  +  l77)<"'(l)     <    (f/^-77)<«)(l). 

Consequently,  the  iterates  (t/*^77)<"*(l)  for  k  >  k^  are  bounded  by 

M(a,  a,  c)  +  Ci(a)M  +  c^{a)M. 

This  trivially  implies  the  conclusion  of  Theorem  5. 

The  proof  of  the  next  theorem  is  due  originally  to  R.  Bellman.  We  present 
it  for  completeness. 

Theorem  6.  There  exists  at  most  one  continuous  solution  of  Urr  =  n  for  which 
77(0)  =0a«^77(l)  =  1. 

Proof.  (By  contradiction.)  Let  ^r-^^  and  Tig  denote  two  solutions  with  the  pre- 
scribed boundary  conditions.  Put  ttq  =  ti^  —  Tig;  then  77-q(0)  —  77q(1)  —  0.  Let  ?q 
be  a  point  where  tt-q  achieves  its  maximum.   Since 

77(g  =  (1   -  t^jniatf,)  +  to7r(cc  +  (1   -  a)/o), 

we  deduce  that  0?o  is  ^^^o  a  maximum  point.  Iterating,  we  find  by  continuity  that 
77(0)  =  0  is  the  maximum  value  of  7r(r).  A  similar  argument  shows  that  0  —  min  7r(r), 
which  implies  that  77j^  —  n^. 

Theorem  7.  For  any  function  Tr{t)  —  t^  with  oo  >  r  >  I,  U'"{t^)  converges 
uniformly  as  n  ^  co. 

Proof.     Clearly  t  >  f  >  p{t),  where 

'0,  for  0  <  r  <  /q; 


pit) 


,   for  tn  <t  <\; 

1   -  ^n 


and  ?o  is  close  to  1  with  r  fixed.  Since  Ut  is  convex  by  Theorem  4,  and  the  values  at 
0  and  1  are  fixed,  we  find  that  t  >  Ut.   Hence 

jjnf  >    ijn^l  f  >  0, 

and  lim  VH  =  Bit)  for  every  t.  Since  6(0  is  convex,  and  by  Theorem  5  the  derivatives 
of  U^t  at  1  are  uniformly  bounded,  we  conclude  that  6{t)  is  continuous.  By  Dini's 
theorem  the  convergence  of  U"t  to  6(t)  is  uniform.  Obviously,  Ud  =  6.  On  the  other 
hand,  if  t^  is  close  to  1  then  (C^)'(l)  </''(!)  (see  the  proof  of  Theorem  5).    Since 
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Theorem  4  guarantees  the  convexity  of  Up,  and  the  slope  at  0  is  0,  it  follows  that 
Up  < p,  and  hence  U^p  <  U'^+'^p;  therefore  lim  U'^p  =  (f>(t).  Again,  <f>{t)  is  a  con- 
tinuous fixed  point,  and  therefore  by  Theorem  6  we  infer  that  (f)(t)  =  d(t).  On  account 
of  U"t  >  U^f  >  U^p,  we  deduce  that  lim  U"t^  =  4>{t)  with  the  convergence  being 
uniform. 

We  denote  this  unique  fixed  point  of  U  by  <f}^  J^t),  or  by  4>it)  whenever  no  am- 
biguity arises. 

Theorem  8.  77?^  iterates  U'^  converge  strongly  {that  is,  U^n  converges  uniformly 
for  any  continuous  function  tt). 

Proof.  The  constant  functions  are  fixed  points  of  U^.  Consequently  by 
Theorem  7,  U^q  converges  uniformly  for  any  function  q(t)  in  the  linear  space  L  spanned 
by  the  functions  (1,  T).  The  set  L  is  dense  in  the  space  of  continuous  functions.  More- 
over, as  i|£/"||  =1,  by  a  well-known  theorem  of  Banach,  U'^q  converges  strongly 
when  applied  to  any  continuous  function  q(t). 

The  actual  limit  is  easily  seen  to  be  given  by 

lim   U"qit)  =  qilU.Jt)  +  q{0)  [1  -  cf>^Jt)].  (6) 

n-*oo 

This  is  an  immediate  consequence  of  the  fact  that  the  fixed  points  of  U  consist  of  the 
two  dimensional  space  spanned  by  the  function  1  and  ^„  ^.  Equation  (6)  shows  that 
two  functions  q^  and  ^o  which  agree  at  0  and  1  have  the  same  limit.  This  enables  us 
to  show: 

Theorem  9.  If  qiO  is  any  bounded  function  continuous  at  0  and  1,  then  U^q 
converges  strongly. 

Proof.  Let  q(t),  in  addition  to  being  continuous  at  0  and  1,  possess  finite 
derivatives  at  0  and  1.  Then  clearly  there  exist  two  continuous  functions  h^it)  and 
/?2(0  with 

Ihit)  >q{t)  >  h^it), 

where  h^iO)  —  /rgCO)  and  h^il)  =  h^{\).  We  conclude  the  result  from  this  using  the 
argument  of  Theorem  7  and  equation  (6).  If  now  q{t)  is  only  continuous  at  0  and  1, 
then  we  can  find  for  any  e  a  q^{t)  satisfying  the  properties  assumed  about  q{t)  in  the 
first  part  of  the  proof  with  \q{t)  —  qjit)\  <  e.  As  ||f/"||  =  1,  the  conclusion  of  the 
theorem  now  follows  by  a  standard  argument. 

Theorem  10.  //  W'\t)\<Ci  for  0  <  i  <  m,  then  Wrr^'Kt)]  <  Ci  for 
0  <  /  <  m. 

Proof.  The  proof  is  by  induction.  For  r  =  0,  the  result  is  trivial  since  U 
preserves  positivity,  and  the  constant  functions  are  fixed  points  of  U.  Suppose  we  have 
established  the  result  for  r  =  m  —  \.   We  note  that 

f/TT-*"'*  --  (1   -  t)a'''7T'''^\at)  +  t{\   -  a)('"'77<™)[a  +  (1   -  a)r] 

+  m(l   -  a)'»-i7T('»-i'[a  +  (1   -  a)r]  -  mo'''-'^TT^"'-'^\at). 
This  easily  yields  that 

max  1 1/77< "*'(/)!  <  X  max  U^'^Kt)]  +  Cmax  ^''"-I'COI  , 
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where 

I  =  max  [(1  -  1)0"^  +  t{\  -  a)™]  <  1. 
t 
Therefore, 

max  |(t/*77)(™'(0l  <  Amax|(t/'=-i77)('»)(0l  +  Cmax  KC/^-V^-HOI 
t  t  t 

<  Amax|(C/<'^-i»77)<"»'(0l  +  K 
t 

by  our  induction  hypothesis.   Iterating  this  last  inequahty  yields 

k-l 

max  |(t/'^77)<™'(0l  <  ^  ^'K  +  A^max  |77<™>(0I  <  M. 

t  i=0  t 

This  establishes  the  theorem. 

Theorem  U.     Ifqit)  belongs  to  C"  {n  continuous  derivatives),  then 

lim  [U'''q{t)T^ 

m—^co 

converges  uniformly  for  0  <  r  <  n  —  I. 

Proof.  We  prove  the  theorem  only  for  r  =  1,  for  the  other  cases  are  similar. 
On  account  of  Theorem  10,  the  uniform  boundedness  of  {Lf^qf^^  implies  the  equi- 
continuity  of  U™q^^K  Thus  we  can  select  a  subsequence  converging  uniformly  since 
Ijm^a)  are  also  uniformly  bounded.   Let 

T(0  =  lim  t/"'^^*!'. 

i—*co 

Since  lim  V^'q  converges  uniformly  to  a  unique  limit  6{t),  we  obtain  6'(t)  =  ^(t). 
As  d'(t)  is  independent  of  the  subsequence  chosen,  the  conclusion  of  the  theorem  easily 
follows. 

Theorem  12.     The  fixed  point  (fi^,^  is  analytic  for  0  <  t  <  I  with  <f>'Jl^  >  0. 

Proof.  Let  p(t)  denote  a  function  infinitely  diflferentiable  with  p^^\t)  >  0  and 
^(0)  =  0,  /7(1)  =  1.   By  virtue  of  Theorem  1 1  and  Theorem  4  we  deduce  that 

lim  (£/"/?)<'■'  =  <l>i%  >  0. 

Therefore  <f)^„.  is  absolutely  monotonic  and  hence,  by  a  well-known  theorem,  is  analytic. 

At  this  point  it  seems  desirable  to  summarize  the  analogous  results  of  Theorems 
2  through  Theorem  12  for  the  case  where  <x  +  a  <  I.  We  enumerate  the  correspond- 
ing theorems. 

Theorem  4'.  If  (-ly-'^rr^^Kt)  >  0  for  i  =  0,  \,2,  .  .  .  ,  n,  and  7T(t)  >  0,  then 
(-iy-\U7ryiKt)  >0. 

In  particular,  positive  increasing  concave  functions  are  transformed  into  func- 
tions of  the  same  kind. 

Theorem  5'.  If  C  >  Tr(t)  >  0  and  C  >  (-iy-'^TT^^\t)  >  0  for  1  <  i  <  n, 
then  0  <  (-l)^-i(C/'"77)<^>(0)  <  Ki,  and  hence  W TT^'Kt)\  <  Ktfor  1  <  i  <  n. 
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Theorem  6  remains  unchanged  and  is  valid  independent  of  the  conditions  on 
a  and  a,  provided  only  they  lie  in  the  open  unit  interval. 

Theorem  7  holds  with  a  modification  of  the  proof  where  p{t)  is  replaced  by 
the  concave  function 

'  \,      for  1  >  ?  >  ?o 


pit)- 


1 

-t,    for  0  <  r  <  ?o 

^0 


y. 


and  the  functions  f  are  replaced  by  1  —  (1  —  ty.  These  also  constitute,  with  the 
constant  function,  a  family  of  functions  whose  linear  span  is  dense  in  C[0,  1].  This 
enables  us  to  infer  the  validity  of  Theorem  8.  Theorems  9,  10,  and  11,  with  suitable 
changes  in  their  statements  which  we  leave  for  the  reader,  are  established  by  simple 
appropriate  modifications  similar  to  that  indicated  above  for  Theorem  7.  The  unique 
solution  ^5  a  for  this  situation,  where  a  +  cr  <  1,  is  completely  monotonic  and  hence 
analytic.  In  the  remainder  of  this  section  the  theorems  are  established  without  any 
specification  as  to  the  value  of  a  +  (t. 

Theorem  13.     The  functions 

00 

<^m(0   =    I    U^m    -  t)] 
n  =  m 

converge  geometrically  to  0. 

Proof.     It  is  immediate  from  (6)  that 

U^m  -  t)]  =  T,(0 

tends  uniformly  to  zero.  Since  the  derivative  at  0  and  1  of  /(I  —  0  is  1  and  —1,  we 
conclude  by  Theorem  11  that  for  n  sufficiently  large  there  exists  an  n^iX)  such  that 

t/^o^l  -  t)]  <  Xt{\  -  t) 

with  A  <  1 .   Let  knQ  denote  the  last  integer  k  for  which  kn^  <  m.   We  obtain 

0  <  <i>J,t)  <  <^,,„(0  < 2  UV{\  -  t)]  <  Ck^  <  Cp<"o+i)fc  <  Cp™, 

t    —  ^   1=0 

where 

p   =  ?}nno+l)   <   1. 

Theorem  14.    Ifq{t)  is  continuous,  \q\\)\  <  ooand\q'(0)\  <  oo,  then  Urn  U^[q(t)] 
converges  geometrically. 

Proof.     We  first  establish  the  result  for  special  functions  T  with  1  <  r  <  oo. 
A  simple  calculation  shows  that 

-Ctil  -t)<  U{t')  -f<  Ct{l  -  t). 

For  n  <  m,  we  obtain  upon  continued  application  of  U  and  summation  that 

n  n 

-C  2    t/'[r(l  -  t)]  <  U%t')  -  £/"»(r)  <  C  2    U\t{\  -  t)). 

i=m  i=m 

The  conclusion  now  follows  from  Theorem  13.    The  general  function  q{t),  satisfying 
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the  hypothesis  of  Theorem  14,  can  be  bounded  from  above  and  below  by  two  poly- 
nomials Pi(0  and  PaCO  which  agree  at  0  and  1 .  The  result  now  follows  directly  from 
this  fact  and  the  first  part  of  this  proof. 

We  observe  easily  the  identity 

Ut  -  t  ={:^  +  a  -  \)t{\  -  t). 

Applying  successively  U  and  adding,  we  obtain 

00 

<f>,^,  =  lim  C/-?  =  ?  +  (a  +  a  -  1)   2   K^^  -  0-  (7) 

W— >-00  w  =  l 

This  is  useful  for  purposes  of  calculation. 

Some  remarks  describing  the  dependence  of  (f)^,^  on  a  and  a  are  in  order.  We 
consider  the  following  identity : 

w  — 1 

Ula   -   Va-,0.'    =    2     U^a,.{U,,.   "   U,;.'W::r\  (8) 

1=0 

If /(O  is  any  function  with  bounded  derivatives,  then  we  obtain  by  the  mean-value 
theorem  that 

Kt^a.a   -   f/a',a')/l    <  1(1    "  0  [/(^O   -/(<t'0]   +  ?[/(«   -^  (1    -  <x)t) 
-/(a'   +(1    -aOOll 
<  C{\o  -  a'\  +\oL  -  a'|)r(l   -  t). 

Applying  equation  8  to  f(t)  =  <f>a',a',  ^^^  remembering  that  inequalities  are  preserved 
by  Theorem  2,  we  obtain 

M  — 1 

\U^^<l>a'  a'  -  4>a',J   <  C{\a  -  a'\  +  ja  -  a'l)    ^    U^td  -  t)). 

i=0 

Allowing  «  to  go  to  oo,  we  have  easily  that 

\i>a,a  -  4>a\a'\    <  ^(k   "  ^^'l    +  k   -  «'!), 

where  K(rj)  is  finite,  provided  that  0  <  r]  <  a,  a'cr,  o'  <  I  —  t]  <  I. 

It  is  worthwhile  to  discuss  the  nature  of  ^„  ^  for  (a,  a)  lying  on  the  boundary 
of  the  unit  square.  First,  we  observe  by  direct  verification  that  when  a  +  0  =  1,  then 
^ff,a(^)  =  ^-   Next  let  a  =  0  and  a  <  I;  then 

1/(1,  =  (1   —  x)(f>{ax)  +  x<f)(x). 

Therefore,  if  ^  is  a  fixed  point  with  <^(0)  =  0  and  0(1)  =  1,  then  for  x  #  1  we  have  that 
(f,(x)  =  (f>(Gx),  and  hence  0(a;)  =  <^(0)  =0(0<a?<l)  provided  that  (f>  is  continuous 
at  0.  Similarly,  when  g  =  I  and  a  <  1  then  the  only  fixed  point  <f>  continuous  at  1 
and  satisfying  0(0)  =  0,  0(1)  =  1,  is  (f>(x)  =  1  for  0  <  a;  <  1.  On  the  other  two 
boundaries  of  the  unit  square  the  solutions  are  easily  calculated  and  turn  out  as  follows : 
If  0  <  (7  <  1  is  arbitrary  and  a  =  1 ,  then 

00 

'f>a,l    =   1    -   TT  (1    -   ^'"^)> 
r  =  0 

while  when  o  —  0,  0  <  a  <  \,  then 

00 

<^a,a   =  TT  L^'X, 
r  =  0 
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where  L°  =  I  and  the  operation  L  appUed  to  x  gives  a  +  (1  —  a.)x.  Finally  for  a  =  0, 

0  =  1  the  operator  U  reduces  to  the  identity  mapping.  We  now  investigate  the 
dependence  of  0^  3,  on  a  and  a  as  we  allow  a  and  a  to  tend  to  the  boundary.  We  limit 
our  attention  for  definiteness  to  studying  the  case  where  (a,  a)  -^  (ap,  0)  with  a^  <  I, 
and  we  show  that  <^o  ^  converges  pointwise  to  0  for  0  <  a;  <  1,  and  (f>a,a(^)  —  1  other- 
wise. Moreover,  the  convergence  is  uniform  in  any  interval  0<x<l  —  8  <  1. 
Let  (o-„,  a„)  -*  (o-Q,  0);  then  without  loss  of  generality  we  may  assume  that  1  —  (t„  — 
a„  >  0.  Therefore  the  <f)„  ^  are  convex,  monotonic  increasing  and  positive,  with 
'Pa  ,a  (0)  =  0.  Also,  for  any  interior  interval  0<a:;<l  —  a  <  I,  the  first  derivatives 
(^^  5£  are  uniformly  bounded.  Since  this  implies  the  ^^  ^  are  equi-continuous  over 
the  subinterval,  and  as  0  <  <f>a„,a^  <  1,  we  can  select  a  subsequence  which  may  be 
denoted  as  <!>„  ^  converging  to  ^(t)  uniformly,  for  any  interval  of  the  form  0  <  x  < 

1  -  (5  <  1.   As' 

we  get  T(l)  =  1  and  similarly  T(0)  =  0.  The  uniform  convergence  of  i^a^ar  guarantees 
the  continuity  of  *F  at  zero. 
Put 

Ur  =   Ua^^oL,,       Uq  =   U^^^Q,      and  <j>r   =  <i>a^,a^. 

We  consider  the  following  identity : 

T  -  C/oT  =  (T  -  ^,)  +  {<i>,  -  t/,^0  +  (t/,T  -  t/oT)  =  /i  +  4  +  /g. 

We  take  a  fixed  x  <  \;  then  trivially  |/i|  =  |T  —  ^^|  <  e  when  r  is  sufficiently 
large.   Also 

I/2I  =  10,  -  t/,T|  =  \U,4r  -  Ur'y]  =  1(1  -  xMria^x)  -T(cT,a;)] 

+  a:[<;6,(a,  +  (1   —  a.^)x)  —  T(a,  +  (1   —  a,)a;)]|. 

But  for  ic  =  a^o  <  1  fixed,  we  observe  that  a,  +  (1  —  oi.j)xq  varies  in  an  interval 
<  1  —  ^  as  a,  ^  0,  and  the  same  applies  to  Oj-x.  The  uniform  convergence  of  <^,  ^  T 
inside  0  <  a;  <  1  —  S  yields  l/gl  <  e.  By  construction,  \I^\  <  e  for  r  large.  Thus 
we  infer  the  equality  T  =  t/^T  for  0  <  a^  <  1,  and  by  direct  verification  for  x  =  I. 
However,  the  fixed  point  to  the  equation  C/o^  =  T  with  T(0)  =  0,  T(l)  =  1  and  T 
continuous  at  0  is  ^{x)  =  1  for  0  <  a;  <  1  and  T(l)  =  1.  Thus  the  hmit  function  Y 
is  the  same  for  every  subsequence  of  a„  ^  ,  and  hence  we  deduce  that  4>a„,a„  converges 
pointwise.  We  furthermore  note  that  T  is  independent  of  CTq  <  1-  ^  similar  analysis 
applies  to  the  case  where  (a,  a)  ^  (1,  a)  (a  >  0).  The  continuity  properties  of  the 
solution  for  the  other  two  boundaries  yield  to  simpler  analysis.  Summarizing,  we  have 
established  the  following  theorem : 

Theorem  15.  The  fixed  points  (ji^^a  satisfy  the  following  continuity  properties: 
IfO  <  rj  <  a,  a.'  <  I  and  0  <  a,  a'  <  I   —  rj,  then 

l<^a,a   -   <^a',a'l    <  ^(^)[k   "   f^'l    +  !«   "   «'!]. 

If  {a,  a)  -^  (o-p,  0)  with  CTq  <  1,  then  <f>a,r>.(^)  -^  0  pointwise  for  0  <  a;  <  1  and  (f>aj^l)  =  1. 
If((y,  a)  ^  (1,  a^)  with  a^  >  0,  then  'i>a,rj<x)  -^  1  pointwise  for  0  <  a;  <  1. 

Finally,  a  word  concerning  convergence  of  V^tt  for  tt  continuous  when  the 
parameter  values  lie  on  the  boundary.    When  a  =  0,  o-  <  1,  then  V^tt  converges 
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pointwise.    The  same  conclusion  holds  when  a  >  0  and  cr  =  1.    On  the  other  two 
boundaries  the  convergence  is  uniform  for  V^tt.   We  omit  the  proofs. 
We  now  return  to  the  study  of  the  operator  T. 

Theorem  16.  For  any  distribution  the  iterates  T^F  converge  in  the  sense  of 
distributions  to  the  distribution 

G{x)  =  I,ix)U,_,  dF  +  /o(:tof  (1  -  <^„.,)  dF, 

where  Iq(x)  and  Ii{x)  are  the  distributions  concentrating  fully  at  0  and  1  respectively. 

Proof.  From  the  convergence  of  U"-tt  for  any  continuous  function  77  and 
Theorem  1  follows  the  weak*convergence  of  T^F.  This  is  equivalent  to  the  convergence 
of  T^F  in  the  sense  of  distributions.   The  actual  form  of 

lim  r"F  =  G 

W— >-00 

as  given  in  the  theorem  follows  directly  from  (6). 

By  choosing  the  distribution  F  =  I^  ,  we  obtain  from  Theorem  6  that  (f>a,^i^o) 
represents  the  probability  with  which  the  limiting  distribution  concentrates  at  1,  or  in 
other  words — as  can  be  easily  shown — the  probability  with  which  the  particle  beginning 
at  Xq  will  converge  to  1 .  This  furnishes  a  probability  interpretation  to  the  fixed  point  of 
the  operator  U  which  is  different  from  a  constant. 

In  connection  with  Theorem  8,  we  remark  that  U^-n-  cannot  converge  for  an 
arbitrary  Lebesgue  measurable  bounded  function.  In  fact,  if  we  assume  that  U"tt 
converges  for  every  bounded  measurable  function  77(f),  then  T^F  would  converge 
weakly  if  F  were  absolutely  continuous.  Since  the  space  of  all  integrable  functions 
L[0,  1]  is  weakly  complete,  and  T  maps  distributions  into  distributions,  we  could  find 
a  fixed  point  TF  =  F  with  F  absolutely  continuous  and  total  variation  1 .  However, 
in  view  of  (16)  the  only  fixed  distributions  which  exist  concentrate  only  at  0  and  1,  and 
hence  cannot  be  absolutely  continuous. 

Finally,  we  present  a  slight  application  of  Theorem  14.  We  show  that  the 
expected  position  of  the  particle  converges  geometrically  for  any  starting  distribution, 
although  the  iterated  distributions  converge  slowly  to  the  limiting  distribution.  The 
expected  position  of  the  particle  is  given  by 

1 

X  dF(x)  =  {x,  F), 
0 

where  F  is  the  cumulative  distribution  describing  the  position.  The  expected  position 
at  the  «th  step  is  given  by 

(x,  T'^F)  =  (U"x,  F). 

On  account  of  Theorem  14,  Lf^x  converges  geometrically,  which  establishes  the  asser- 
tion. The  same  conclusion  applies  to  all  the  moments.  This  observation  is  very  useful 
for  computational  and  estimation  purposes. 

Finally,  we  note  that  the  spectrum  of  the  operator  T  cannot  consist  of  the 
isolated  point  1.  Otherwise,  by  standard  techniques  one  can  show  that  U"tt  converges 
for  any  measurable  bounded  function  77. 
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2.  In  this  second  model  the  random  walk  is  described  as  follows:  If  the  particle 
is  at  a:,.then  x  -^  a  +  (1  —  a)x  with  probability  <f>ix)  and  x  -*  ax  with  probability 
1  —  <i>{x),  where 

\<f,{x)  -  cf>{y)\  <ix<\. 

The  analogous  transition  operator  to  (1)  becomes 

V/ff  /-(ai-aVd-a) 


(1  -  i>{t)){dF{t))  +  4>(t)dF(t), 

0  Jo 


(9) 


with  the  same  understanding  concerning  F  applying  as  before.   Let 

Un  =  [l   -  i>{t)]7riat)  +  <^(077[a  +  (1   -  a)/].  (10) 

In  this  section,  we  take  0  <  a,  c  <  1 ;  the  case  where  boundary  values  for  a  and  a 
are  considered  is  easy  to  handle  but  not  of  great  interest.  The  spaces  on  which  they 
operate  are  the  same  as  in  §  1.   Again,  in  a  similar  manner  to  Theorem  1,  we  obtain: 

Theorem  17.     The  operator  T  is  conjugate  to  the  operator  U. 

We  now  further  assume  that  ^(0  is  monotonic  increasing.  This  model  includes 
the  important  case  where  <f>{t)  =  A  +  fxt,  where  A  +  ^<  <  1 ;  and  whenever  A  +  ^  =  1 
then  A  >  0. 

Theorem  18.  The  operator  U  preserves  positivity  and  positive  monotonic  in- 
creasing functions. 

Proof.     Direct  verification. 

Since  the  hypothesis  on  ^(0  implies  either  ^(1)  <  1  or  ^(0)  >  0,  we  analyze 
the  case  where  0(1)  <  1.  The  other  circumstance  can  be  treated  in  an  analogous 
manner.   Furthermore,  we  now  assume  that  if  <^(0)  =  0,  then  <^'(0)  exists  and  is  finite. 

Theorem  19.  If  Tr{t)  is  monotonic  increasing  bounded  and  positive,  then  U^tt 
converges  uniformly  to  a  constant. 

The  proof  can  be  carried  out  easily  using  the  techniques  employed  above. 

The  hypothesis  on  (f){t)  easily  yields  the  fact  that  the  only  continuous  fixed 
points  of  UiT  =  TT  are  constant  functions.  The  proof  is  similar  to  the  proof  used  in 
Theorem  6.  This  fact  directly  connects  with  the  result  of  Theorem  21  below.  First, 
we  complete  the  proof  of  convergence  of  V^tt  for  any  continuous  function  77(0. 

Theorem  20.     The  operators  U^n  converge  uniformly  for  any  continuous  function. 

Proof.  Since  \\U"\\  =  1,  and  the  space  of  all  monotonic  positive  continuous 
functions  spans  a  dense  subset  of  the  set  of  all  continuous  functions,  the  theorem  follows 
by  a  well-known  theorem  of  Banach. 

Theorem  21.  For  any  distribution  F,  the  distributions  T^F  converge  as  distribu- 
tions to  a  unique  distribution  G  for  which  TG  =  G  which  is  independent  of  F. 

Proof.  The  weak*convergence  of  r"F  follows  directly  from  Theorem  20  and 
Theorem  1 6.    To  complete  the  proof  we  must  establish  that  if  lim  T'^F  =  G  and 
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lim  T^H  =  K,  then  G  =  K.   Indeed,  let  T  denote  any  continuous  function.   We  have 
that 


=  "(1"^-/' 


(T,  G  -  K)  =  lim  (T,  T%F  -  H))  =  lim  (t/'^T,  F  -  H)  =  a{  \  dF  -  \  dH  ]  ^0 

91— >-C0  M— >-00 

(11) 

as  F  and  /f  are  distributions.   Hence 


W{t)dF{t)  =  { 


for  any  continuous  function  T,  and  therefore  G  =  K. 

It  seems  extremely  difficult  to  determine  the  complete  nature  of  this  unique 
fixed  distribution.    We  shall  say  more  about  it  in  a  later  section.    We  denote  it  by 

Theorem  22.  The  distribution  F^r^  is  a  continuous  function  of  a,  a;  that  is, 
if{G^,  a„)  -^  ((T,  a)  with  0  <  cr,  a  <  1,  then  F^^^^^  -^  F^^rt  at  every  point  of  continuity  of 

Proof.  Let  (cr„,  a„)  ->  {a,  a);  by  Helly's  theorem  we  can  choose  a  subsequence 
Fr  =  Fa^  ,0L„  converging  to  the  distribution  F  at  every  continuity  point.  Write  Tj.  for 
T„^  ,a„  and  T  for  T^,^.  Let  7t(/)  denote  any  fixed  continuous  function.  We  consider 
the  quantity 

(tt,  F  -  TF)  =  (77,  F  -  F,)  +  (77,  F^)  -  (77,  TF^)  +  (77,  FF^  -  TF). 

Since  F^  ->  F  as  distributions,  we  find  for  r  sufficiently  large  that  [(77,  F  —  F^)!  <  e. 
Now  we  note  that 

1(77,  Fr)  -  (77,  TFr)\  =  1(77,  F,F,)  -  (77,  FF,)|  =  |(t/,77  -  C/77,  F,)l. 

Since  C/  =  U„^  ^^  converges  strongly  to  f/  =  U^^^,  as  is  trivial  to  verify,  it  follows  that 
C/^77  converges  uniformly  to  U-n.   Whence,  as  F^  are  distributions,  we  infer  that 

\{Ur-n  —  Utt,  Fr)\  <  max  lt/^77  —  Utt\  <  e 

I 

when  r  is  chosen  large  enough.   Evidently,  with  r  large  we  get  as  before  that 

1(77,  T{Fr  -  F))\    =  \iUn,  Fr  -  F)\    <  e. 

Therefore  we  obtain  for  r  large  that  1(77,  F  -  FF)|  <  3e,  and  hence  (77,  F)  =  (77,  TF). 
Since  77  is  any  continuous  function,  we  infer  F  =  TF  and  therefore  F  =  F^,^  by 
Theorem  21.  Consequently,  as  any  limit  distribution  of  F„  ^  must  be  F^ ^  the  con- 
clusion of  Theorem  22  is  now  immediate. 

m 

"  3.   The  model  considered  in  this  section  is  with  ^{x)  =z  \  —  x.    In  this  case  <f>  is 

monotonic  decreasing.  The  operator  U  becomes 

Urrit)  =  tTT{at)  +  (1   -  t)7T(\   -  a  +  at).  (12) 

Note  that  we  have  replaced  a  by  1  —  a.  This  is  only  for  convenience  in  Theorem  28, 
and  does  not  restrict  any  generality.  In  this  model  the  closer  the  particle  moves  to 
the  ends  0  and  1  the  greater  probability  there  is  of  moving  back  into  the  interior.  The 
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situation  described  here  is  of  completely  reflecting  boundaries.  Again  it  is  easy  to  show 
that  the  only  continuous  fixed  points  Un  =  n  are  the  constant  function.  Therefore, 
we  shall  find  as  in  §  2  that  the  distributions  describing  the  position  of  the  particle 
converge  to  a  limit  distribution  independent  of  the  initial  distribution.  We  first  proceed 
to  analyze  convergence  properties  of  W^n.  In  this  case  it  is  no  longer  true  that  U 
preserves  the  class  of  positive  monotonic  functions.  Only  positivity  is  conserved  by  the 
mapping  U.  However,  a  new  quality  as  described  in  Theorem  23  serves  here  well. 
Throughout  this  section  in  order  to  avoid  trivial  changes  of  proof  and  different 
results  at  times,  we  suppose  that  0  <  a,  cr  <  1 . 

Theorem  23.     If  Tr{t)  has  a  continuous  derivative,  then 

max  |(^7r)'(0l  <  max  |77'(r)|, 
t  t 

with  equality  holding  if  and  onlv  if  -^it)  is  linear. 

Proof.     By  direct  computation,  we  obtain 

U-rr'it)  =  tav'(ot)  +  (1   -  t)aTT'(l    -  a  +  a/)  +  -rriat)  -  Tr{l   -  a  +  a/). 

Hence,  with  the  aid  of  the  mean-value  theorem  we  get 

max  |t/77'(/)|  <  max  \tan'{at)  +  (1  -  t)o!.n'(l  -  a  +  a^l  (13) 

t  t 

7r((7t)  -  77(1   -  a  +  at) 
+  (at  -  (1   -  a)  -  at) 


Gt  —  (I   —  a)  —  at 

<  max  [ta  +  (l  —  t)a  +  I  —  a  —  (a  —  a)t]  max  \Tr'{t)\  =  max  UXOI- 
t  t  t 

If  equality  holds,  then  let  t^  denote  a  point  where 

max  |77'(0I  =  k'(?o)l- 
t 
It  follows  easily  from  (13)  that 


max  \TT'{t)\  =  |7r'(cT^o)l  =  l^'(l  —  Q'  +  o'^o)!  = 


TT{ot^  —  77-(l  —  a  +  at(^ 
atQ  -  (1   -  a)  -  ato 


(14) 


This  yields  that  77(0  is  linear  for  cr^o  <  ^  <  1  —  a  +  at^,  or  otherwise  somewhere 
between  ct/q  and  1  —  a  +  at^  the  slope  has  greater  magnitude  than  the  slope  of  the 
chord  subtended  by  Tr(t)  at  these  points.  Equation  (14)  shows  also  that  at^  and  (1  — 
a  +  a/g)  are  maximum  points  of  7T'(t).  Repeating  this  argument  successively  then 
implies  that  equality  in  (13)  requires  7T{t)  to  be  linear. 

Theorem  24.     //  7r{t)  belongs  to  C"*  [7r(;')  possesses  m  continuous  derivatives], 
then  max^  |(t/"77-)''"'(/)|  is  uniformly  bounded  in  n  for  each  r  (0  <  r  <  m). 

Proof.     The  proof  is  similar  to  that  of  Theorem  10. 

Theorem  25.     //  7t(?)  possesses  two  continuous  derivatives,  and  0  5^  a,  then 
U^TT  converges  uniformly  to  a  constant. 

Remark.     The  reason  why  the  two  cases  a  =  a  and  a  ^  a  are  distinguished, 
and  necessarily  so,  will  be  explained  later. 
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Proof.  In  view  of  Theorem  23  and  Theorem  24,  the  first  and  second  derivatives 
of  L/"v  are  uniformly  bounded.  Thus  U^tt  and  (U^tt)'  constitute  equicontinuous 
famiUes  of  functions.  We  can  thus  select  a  subsequence  rij  such  that  U"'tt  converges 
uniformly  to  (j)(t),  and  (U^'tt)'  converges  uniformly  to  <l>'(t).  It  follows  trivially  that 
ifrii+i^  tends  uniformly  to  Ucf)  and 

Moreover,  by  virtue  of  Theorem  23, 

max  KC/^'TrVI  >  max  |(C/"'+M'l  >  max  |([/"'+i77)'|.  (15) 

I  t  t 

Hence 

lim  max  \{U'''tt)'\  =  lim  max  Kf/^'+^Tr)'!  =  lim  max  Kf/^'+V)'!. 

i—fCO        t  (■—►00        t  i—>-oo        t 

Therefore,  by  the  uniform  convergence  of  the  derivatives,  we  secure 

max  |<^'(0I  =  max  |((7<A)'(0l  =  max  i(t/20)'(O|. 
t  t  t 

Invoking  Theorem  23  yields  that  4>{t)  and  U4>{t)  are  linear.  However,  if  a  5^  rr  and  <f>{t) 
contains  a  term  with  t,  then  U4>  is  quadratic.  This  impossibility  forces  4>{t)  to  be 
identically  a  constant.    Let  /  be  chosen  sufficiently  large  so  that 

|t/"'7r   -  c\    <  e. 

Then 

lU'^'^'^TT  -  c\   <  t\U"'7T(ot)  -  c\  +  (1   -  t)  |C/"'7r(l   -  a  +  a/)  -  f I   <  e. 
Repeating  this  argument  shows  that 

IW'^^Tr  -  C\    <  e 

for  any/7.   This  establishes  that  U^tt  converges  uniformly  to  c. 

Theorem  26.     If  7t(/)  is  continuous  and  a  j^  ol,  then  V^-n  converges  uniformly. 

Proof.  The  space  of  all  functions  with  two  continuous  derivatives  spans 
linearly  a  dense  subset  of  the  space  of  all  continuous  functions.  Since  ||t/"||  =  1, 
we  obtain  the  result  using  Theorem  25  and  a  well-known  theorem  of  Banach. 

In  the  next  two  theorems  we  establish  the  uniform  convergence  of  V^tt  for  the 
case  where  1  >  cr  =  a  >  0.  We  note  in  this  case  the  interesting  fact  that  t/ applied  to  a 
polynomial  does  not  increase  its  degree.   Particularly, 

Ux""  =  [a"  -  «a"-i(l   -  a)]x«  +  P„_i(a;), 

where  P„_i(x)  denotes  a  polynomial  of  degree  n  —  \. 

Theorem  27.  If  P{t)  is  any  polynomial,  then  U^P  converges  uniformly  to  a 
constant  and  the  convergence  is  geometric. 

Proof.  The  proof  is  by  induction  on  the  degree  of  the  polynomial.  Clearly 
if  P  is  a  constant  =  c  then  U^P  =  c.  Suppose  we  have  shown  for  any  polynomial 
Pn-\  of  degree  <  «  —  1  that  the  iterates  t/''P„-i  converge  uniformly.  To  complete  the 
proof,  it  is  enough  to  verify  that  [/%"  converges  uniformly.   Let 

A  =  a"  -  rta"-i(l   -  a); 
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then  m  <  1  since  1  >  a  >  0.   We  obtain 

f/x"  =  Ax-"  +  P^^iC.r). 
Repeating,  we  get,  for  A:  >  1 , 

k-l 
r  =  0 

This  last  sum  is  of  the  form 

k 
r  =  0 

with  H  |fl,.|   <  00,  and   Hm   b^X^)  exists.    It  is  a  well-known  theorem  that  lim  c^{x) 

*=  00 

exists  uniformly  whenever 

converges  uniformly.  Thus,  U^'x^  converges  uniformly  to  a  fixed  point  which  must  be  a 
constant  function.  Finally  we  note  that  in  the  case  where  a  =  a.  (the  rate  of  learning, 
so  to  speak,  is  the  same  regardless  of  the  outcome  of  the  experiment),  then  C/"P  for 
any  polynomial  converges  geometrically.  The  proof  can  be  carried  through  by  using 
induction. 

This  yields  the  fact  that  the  expected  position  converges  geometrically  to  a 
limiting  expected  position  with  similar  results  valid  for  higher  moments. 

Theorem  28.  If  -rrit)  is  continuous  and  0  =  a  >  0,  then  V^tt  converges  uni- 
formly. 

Proof.     Similar  to  Theorem  26,  since  the  set  of  all  polynomials  is  dense. 

We  now  note  the  important  example  that  when  a  =  0  =  0  it  is  no  longer  true 
that  V^TT  converges.  It  is  easily  verified  that  in  this  case  U'^'^tt  and  C/^"+^7r  converge 
separately  but  that  a  periodic  phenomenon  occurs  otherwise.  The  argument  of 
Theorem  27  breaks  down  in  this  case  as  the  quantity  A  is  —  1 .  We  only  mention  that 
other  difficult  convergence  behavior  occurs  when  a,  a  traverse  the  boundary  of  the 
unit  square  for  this  model.  In  particular,  when  a  =  1  and  cr  <  1  it  is  not  hard  to  show 
that  V^ry^TT  does  not  necessarily  converge  for  every  continuous  function  77,  and  even 
for  the  circumstance  where  77  is  a  polynomial.  The  case  where  cr  =  a  =  1  produces 
for  U  the  identity  operator  for  which  the  convergence  of  f/"  is  trivial.  For  a  <  1  and 
a  =  1  we  can  conclude  again  a  lack  of  convergence.  However,  when  a  =  0  and 
1  >  (T  >  0,  or  (7=0  and  1  >  a  >  0,  then  V^r^-n  converges  for  every  continuous 
function  -n. 

We  return  now  to  the  hypothesis  0  <  a,  a  <  1 . 

Theorem  29.  If  -nit)  belongs  to  C™,  then  (t/*77-)<^'(/)  converges  uniformly  for 
0  <  r  <  m. 

Proof.    This  follows  easily  from  Theorems  24,  26,  and  28.   Let 

f'x/o  rix  +  ix  —  D/a. 

TF=\      t  dF(t)  +  \  (1  -  /)  dFit). 

Jo  Jo 

This  represents  the  transition  law  for  the  distribution  describing  the  position  of  the 
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particle  for  this  model.  By  arguments  analogous  to  those  employed  in  the  preced- 
ing sections,  we  can  establish  the  following  theorems,  using  the  conjugate  relationship 
between  Tand  U. 

Theorem  30.  For  any  distribution  F  the  distributions  T'^F  converge  as  distribu- 
tions to  a  unique  distribution  F„ ^  for  which  TF^^  =  F^^  „,  which  is  independent  of  F. 

Theorem  31.  The  distributions  F„^  constitute  a  continuous  family  of  distribu- 
tions in  the  sense  of  Theorem  11. 

Again  it  seems  very  difficult  to  determine  any  more  explicit  information 
about  iv,a- 

4.  The  model  examined  here  is  such  that  1  —  4>{^)  =  he  -\-  i.i,  with  X  +  fi  <\ 
and  at  least  1  >  A  or  0  <  //.   The  operator  V  has  the  form 

Utt  =  {Xx  +  iu)tt(ox)  +  (1   —  he  —  /h)tt{\   —  a  +  ax).  (16) 

Of  course,  as  before,  0  <  a,  cr  <  1.  Convergence  questions  for  V^tt  turn  out  to  be  very 
elementary  in  this  case  in  view  of  the  following  theorem  which  is  easily  proven. 

Theorem  32.     If  Tr{x)  has  a  bounded  derivative,  then 

max  \{Utt)'{x)\  <  a  max  \Tr'{x)\ 

X  X 

with  a  <  \. 

An  immediate  consequence  of  Theorem  32  is  that  {V^tt)'  converges  geometri- 
cally to  0.  Let  T  denote  the  transition  operator  of  distributions  for  this  model.  In 
the  standard  way,  we  obtain : 

Theorem  33.  For  any  distribution  F  the  distributions  T^F  converge  to  the 
distribution  F^^^  which  is  a  continuous  function  ofio,  a),  and  TF„^  =  F^,,^.  Moreover, 
>F(j  o;  is  independent  of  F. 

5.  77?/^  section  is  devoted  to  some  variations  of  the  preceding  models.  A  new 
feature  added  first  is  that  we  allow  in  addition  to  the  two  impulses  of  motions  towards 
the  two  fixed  points  0  and  1  by  the  transformations 

F-^x  =  ax     and     F^^  =  1   —  a  +  ax 

the  possibility  of  a  third  motion  where  the  particle  stands  still  with  certain  probability. 
These  models  are  particularly  important  in  learning  problems,  and  much  statistical 
investigation  on  this  type  has  been  done  by  M.  M.  Flood  [5].  They  are  referred  to  as 
the  pure  models.  The  mathematical  description  of  the  first  model  of  this  type  is  as 
follows:  A  particle  x  on  the  unit  interval  is  subject  to  three  random  impulses:  (1) 
x  -^  ax  with  probability  -n-^iX  —  x)\  {1)  x  ^  \  —  a.  -\-  olx  with  probability  -n^x;  and 
(3)  X  -*  a;  with  probability  (1  —  rr-^il  —  x)  +  {\  —  tt^x,  where  0  <  w^,  wg  <  1.  This 
is  similar  to  model  I  where  absorption  takes  place  at  the  boundaries  0  and  1.  The 
operator  analogous  to  (2)  becomes 

Utt  =  TT^il   —  x)TT(ax)  +  [(1   —  tt-^){\   —  x)  -f  (1   —  TT2)x]7r(x) 

-I-  Tr^x-niX   —  a  -I-  ax).      (17) 
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Again,  let  T  denote  the  transition  operator  which  maps  the  distribution  locating  the 
particle  into  the  corresponding  distribution  at  the  end  of  the  experiment.  Theorem  1 
is  valid  for  this  setup,  and  T'\s  consequently  conjugate  to  U.  It  is  easy  to  verify  that  U 
fulfills  the  conditions  of  Theorems  2  and  3  and  also  preserves  the  property  of  monotone 
increasing  functions.   Furthermore,  we  obtain : 

Theorem  34.     If  '^,  '^'  ^nd  n"  >  0,  then  {Un)"  >  0  if  and  only  if 

(1  -  (t)77i  +  TTaCa  -  1)  >  0, 

and  otherwise  Uv  preserves  with  -n  and  tt'  >  0  the  property  of  concavity. 

Proof.    The  proof  can  be  carried  through  by  direct  computation. 

We  remark  that  the  remainder  of  the  analogue  to  Theorem  4  does  not  carry 
over  under  the  condition  stated  in  Theorem  34.  Moreover,  noting  that  we  have  here 
changed  a  into  1  —  a  as  compared  to  §  2,  we  obtain  for  tt-^  =  773  =  1  the  condition 
of  §  1  for  preservation  of  convexity,  and  so  on. 

The  analogues  of  Theorems  5,  6,  7,  and  8  easily  extend  to  this  model  by  the 
same  methods,  and  we  obtain  that  U^n  converges  uniformly  to  a  limit  given  by 

[1    -   <^„,a,.„.,(^)]'^(0)    +   <^a,a,.„.3(^)^(l),  (18) 

where  (ftaa,-^  ,-n  is  the  unique  continuous  fixed  point  of  U<f>  =  (f>  with  ^(0)  =  0  and 
<f){\)  =  1.  The  entire  theory  of  geometric  convergence,  continuity  of  ^  as  a  function 
of  a,  a,  77i,  and  772,  and  the  form  of  the  limiting  distribution  of  the  particle  established 
for  the  model  of  §  1  remains  valid  with  slight  changes  in  the  proofs.  The  general  con- 
clusion is  that  introducing  a  probability  of  standing  still  has  no  effect  on  the  convergence 
of  the  distributions  or  its  limiting  form  provided  only  the  essential  feature  of  absorbing 
boundaries  still  prevails.  Finally,  in  this  connection  we  remark  that  for  special  bound- 
ary values  of  the  parameters  tt^  and  ^2  the  motion  may  become  a  drift  to  one  or  other 
of  the  end  points ;  for  example,  n^  =  0,  TTg  >  0. 

6.  H^e  treat  in  this  section,  the  following  general  nonlinear  one-dimensional 
learning  model.  The  particle  moves  with  probability  <f){x)  from  a;  to  1  —  cc  +  ccx 
and  with  probability  1  —  <f>{x)  from  x  to  ax.  The  function  is  only  continuous  with  the 
additional  important  requirement  for  this  case  that  <f>{x)  >  6  >  0  and  1  —  (f>ix)  >  d  >  0 
for  all  X  in  the  unit  interval.  This  excludes  the  types  of  models  discussed  in  §§  1  and  3, 
but  includes  some  subcases  of  the  examples  investigated  in  §§  2  and  4.  However,  in 
those  cases  we  obtained  much  stronger  results  about  the  rate  of  convergence  of  deriv- 
atives, and  so  on.   The  transition  operators  become 

TF=\      [1  -  <f>(t)]  dF{t)  4-  m  dF{t),  (20) 

jo  Jo 

and  T  is  adjoint  to 

(t/77)(0  =  (1   -  mX<^t)  +  ^(0^(1   -  a  +  aO.  (21) 

We  shall  show  that  U"'tt  converges  uniformly  for  any  continuous  function  nit).  The 
proof  of  this  fact  shall  be  based  on  the  following  highly  intuitive  proposition.  Let 
an  experiment  be  repeated  with  only  two  possible  outcomes,  success  or  failure  at  each 
trial.   Suppose  further  that  the  probability  of  success /?„  at  the  «th  trial  depends  on  the 
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outcome  of  the  previous  trial,  but  that  these  conditional  probabilities  satisfy /?„  >  r]  > 
0;  that  is,  regardless  of  the  previous  number  of  failures  the  conditional  probability  of 
success  is  always  at  least  ??  >  0.  Then  the  recurrent  event  of  a  success  run  of  length  r 
with  r  fixed  is  a  certain  event;  that  is,  with  probability  1  it  will  occur  in  finite  time.  This 
result  can  be  deduced  in  a  standard  way  using  the  theory  of  recurrent  events  [4]. 
We  turn  back  now  to  the  examination  of  V^tt.   Let 

F^x  —  ax    and    F2X  =  1  —  cc  +  ax 

and  by  Fx  denote  the  operation  that  either  F^  or  F2  is  applied.  We  note  the  important 
obvious  fact  that 

\F^x  -  F^y\  <  A'-|x  -2/1,  (22) 

with  0  <  A  <  1 ,  where  F^  denotes  r  applications  of  F^  and  F^  in  some  order  acting  on 
X  and  y  in  the  same  way. 

Next,  we  need  the  important  lemma : 
I 

Lemma.    //    \<t>^'^\t)\  <  K  for    w  =  0,  1, . .  . ,    and    Itt'^^'COI  <  ^1,    then 

I  f7"7T<"*'(/)|  <  ATa  uniformly  in  n  and  t. 

Proof.    The  proof  is  similar  to  that  of  Theorem  24. 

Now  let   77-(/)  denote  a  continuously  differentiable  function.    Consider  the 
following  identity : 


C/«77(x)  -  U^7T{y)  =  (1  -  ct>{x)){\  -  </.(y))[C/"-V(FiX)  -  C/"-i7r(Fi2/)] 
+  4>{x)<i>{y){U--^7r{F2x)  -  i7"-V(F22/)] 
+  (1  -  4>{yM{^)W^^^7r{F2x)  -  U^-MFiy)] 
+  <i>{y){\  -  i>(x))[U---^7r(F^x)  -  U^-^^iF^y)]. 


(23) 


We  continue  to  apply  this  identity  to  the  factors  C/""^7t()  —  C/"~^77();  and  when  any 
term  of  the  form  V^-niF^w)  —  W^niF^z)  is  achieved,  then  that  factor  is  allowed  to 
stand  without  any  further  reduction.  All  other  terms  are  reduced  to  expressions 
involving  as  factors  77()  —  77().   Thus  we  obtain 

when  1^  consists  of  terms  of  the  form 

and  2/?fc  <  1  while  4  consists  of  the  remaining  terms.  We  now  conceive  of  the 
following  probability  model.  Let  two  particles  undergo  the  random  walk  described 
by  this  model  starting  from  x  and  y,  respectively.  We  say  a  success  occurs  if  the  same 
impulse  activates  both  particles,  and  otherwise  failure  occurs.  The  probability  of 
success  is  given  initially  by 

<i>{^)<f>{y)  +  [1  -  <^(^)][i  -  <t>{y)]  >  2^2  >  0, 

and  it  is  easily  seen  that  each  p^,  where  p^.  is  the  conditional  probability  of  success 
occurring  on  the  kXh  trial,  satisfies 

Pk  >  2d^  >  0. 
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Consequently,  a  success  run  of  length  r  is  certain  to  happen  in  finite  time.  In  particular 
as  n  -*  00,  /^  ^  0,  since  /g  is  bounded  by  twice  the  probability  of  no  success  run  in  n 
trials  times  K.  On  the  other  hand,  in  view  of  the  lemma  and  equation  (22)  we  secure 
that  I\  <  CA*-.   Therefore, 

iim  |C/«77(a;)  -  t/"7r(2/)|  <  CA'-, 

n— ►CO 

which  can  be  made  arbitrarily  small  as  r  ^-  oo.   Hence,  if 

lim  £/"77(2/)  =  a 
exists  for  a  single  y,  then 

lim  C/"77(x)  =  a 

for  every  x.   Since  a  subsequence  can  be  found  so  that 

lim  If^'Trix)  =  a 

i— >-oo 

for  one  x  and  hence  for  all  x,  an  argument  used  in  the  close  of  the  proof  of  Theorem  25 
shows  that 

lim  t/"77(x)  =  a. 

W->00 

The  lemma  easily  implies  that  the  convergence  is  uniform.  Using  the  fact  that 
||£/"||  =  1,  we  can  sum  up  the  conclusions  for  this  nonhnear  model  as  follows: 

Theorem  35.  If  v^t)  is  continuous,  then  lim  V^v  exists  uniformly  converging 
to  a  constant  limit.  ""**' 

Theorem  36.     //  <f>{t)  belongs  to  C",  and  nit)  is  in  C™,  then 

lim  (C/'^7r)<™>(/)  =  0 

W-— ►oo 

with  convergence  uniform  in  t.  "• 

Theorem  37.  For  any  distributions  F,  T^F  converges  to  a  distribution  Fg,^ 
independent  of  F  with  TF^^  =  F^,^  and  F(j,a  continuous  with  respect  to  o,  a. 

This  last  theorem  follows  on  account  of  the  conjugate  relationship  of  T  and  U. 
Finally,  we  note  that  the  method  used  in  this  section  can  be  employed  to  analyze 
the  random  walks  with  any  number  of  impulses. 

FiX  =  (1  -  a,)w,-  +  cc^x. 

7.  In  the  present  section  we  investigate  the  nature  of  the  limiting  distribution 
obtained  in  the  various  models.  In  the  case  where  the  boundaries  were  absorbing 
states  as  in  §§  1  and  5,  we  find  that  the  limiting  distribution  is  discrete  and  concentrates 
at  the  two  ends  0  and  1 .   The  weight  at  1  depends  on  the  starting  distribution  F  and 

is  given  by 

•1 
i>g^^(x)dF(x), 


J. 


where  ^^^  is  the  unique  continuous  fixed  point  oi  U<f)  =  <f>  with  ^(0)  =  0  and  ^(1)  =  1. 
Many  properties  of  <^<j,a  are  developed  in  those  sections.    In  all  the  other  types  the 
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ergodic  property  was  seen  to  hold  and  the  limiting  distribution  was  independent  of  the 
initial  distribution.  Let  us  deal  with  the  following  general  type.  The  random  walk  is 
given  by  a;  ^  F^x  =  ax  with  probability  1  —  (pix),  and  x  -*  F^  =  1  —  a  +  aa;  with 
probability  <^{x),  where  1  —  ^  >  j>{x)  >  (5  >  0.  The  relevant  operators  are  given  by 
equations  (20)  and  (21).   Let  the  limiting  distribution  be  denoted  by  F^^^. 

We  now  distinguish  two  cases:  (a)  0  >  1  —  a  and  (b)  ct  <  1  —  a.  Let  us 
examine  case  (b)  first.  We  note  that  the  union  of  the  image  sets  ^^[0,  1]  +  F2[0,  1]  of 
Fj  and  F^  applied  to  the  unit  interval  does  not  overlap  with  the  open  subinterval 
(0,  1  —  a).  Any  two  applications  of  F^  and  Fg  leave  empty  the  two  additional  open 
intervals  (ct^,  (1  —  a)a)  and  (a(l  —  a),  (1  —  a)^).  Proceeding  in  this  way,  we  find  that 
the  limit  of  the  total  set  covered  by  n  applications  of  Fi  (/  =  1 ,  2)  in  any  arrangement  is 
a  Cantor  set  C.  It  is  easily  seen  that  F^^  must  concentrate  its  full  probability  on  this 
set  C. 

Now  let 

..        fl,ifx=fo 

77-,    {X)     =\ 

We  show  that  V^nfjx)  converges  uniformly  to  zero.  Note  that  UTT^J^t)  is  zero  for  every 
^  except  at  most  one  value  of  / ;  namely,  F^^/q  or  F~^?q.  Of  course,  if  cr  <  t^  <  \  —a, 
then  neither  inverse  exists  for  that  t^ ;  and  otherwise  only  one  exists  and 

lUntJ  <  max  [cf>(x),  1  -  cf>(x)]  <l  -  8. 

X 

Similarly,  L/^tt^    <  (1  —  5)'*,  from  which  the  assertion  follows.   We  now  observe  that 

Consequently,  the  probability  of  F^  a  at  /q  is  zero  for  any  t^  with  0  <  ^q  <  1 .  Summing 
up,  we  have  established : 

Theorem  38.  If  a  <  \  —  a,  then  the  limiting  distribution  F^,^  is  a  singular 
distribution  {probability  zero  at  every  point)  spread  on  a  Cantor-like  set. 

We  now  turn  to  examine  case  (a)  where  a  >  \  —  a.  We  note  first  that  at  least 
one  of  the  two  mappings  F~^  or  F~^  is  defined  for  every  x  in  the  unit  interval.  Let  nit) 
denote  any  continuous  positive  function  defined  on  the  unit  interval  so  that  TT{t)  >  ^  > 
0  for  some  subinterval  t^  —  h  <  t  <  t^  +  h  {h  >  0).  Since  at  least  F~^  or  F^^  exists 
at  /(,  (say  F^^),  we  obtain  F^^t^  =  t^.  We  construct  t^  from  /^  in  the  same  way  and 
continue  this  for  n  steps,  obtaining  t^  =  F^^t^,  where  F""  denotes  a  specific  order  of 
application  of  F~^  or  F~^  a  total  of  n  times.  Let  F"  denote  the  reverse  order  of  the 
operators  obtained  by  passing  from  t^  to  /„.   We  note  that 

ipn^  _  fny^    <  ^'^\X  -y\    <  A", 

where  A  <  1 .   Choose  n  so  large  that  A"  <  /; ;  then  for  every  x  we  get  that 

Consequently,  as 

1  >  1  -  ^  >  i>ix)  >6>0, 
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t/"7r  is  positive  for  all  x  since  F-"[tQ  —  h,  t^  +  /;]  covers  the  entire  unit  interval  and 
TT{t)  >  7?  >  0  on  this  initial  interval  which  is  spread  out  by  the  term  in  t/"  involving 
F".   We  have  thus  shown : 

Theorem  39.  If  a  >  \  —  a,  the  operator  U  is  strictly  positive;  that  is,  for  each 
positive  continuous  function  Tr(t)  there  exists  an  n  depending  upon  n  so  that  U^-n  is  strictly 
positive. 

Now  let  TTf  (t)  be  defined  as  before.  Again  we  establish  that  U^nt  converges 
uniformly  to  zero.  To  this  end  we  observe  that  U-n-f^  has  at  most  two  possible  values 
at  F^^tQ  and  -F^^^o  given  by  1  —  (f>{F~^tQ)  and  (f>(F~^to),  respectively,  while  Utt^  =  0 
elsewhere.  Also,  U^nf  has  at  most  four  possible  values  and  the  maximum  value  that 
could  be  achieved  for  U^tt,  is 

'0 

max  {[1  -  cf>(F-\m  -  <f>iF^\)l   <f>{F-\)<i>{F-\), 

[1  -  cf>{F-\)mF-^F-\)  +  <t>{F-\)[\  -  4>{F-^F-\)]} 

To  secure  a  bound  for  the  maximum  of  U^ttiq^  ^^^  ^^  consider  the  same  repeated- 
experiment  model  set  up  in  the  previous  section.  The  conditional  probabilities  of 
success  pn  at  the  «th  trial  satisfy  the  uniform  inequalities  1  >  1  —??>/?„>  ^  >  0, 
where  success  in  this  case  is  taken  to  be  an  application  of  the  impulse  F^  to  the  particle. 
It  is  readily  seen  by  standard  inequalities  that  the  probability  of  securing  k  {k  <n) 
successes  converges  uniformly  to  zero  as  «  ->  co.  Moreover,  it  follows  directly  that 
maXfc(probability  of  k  successes)  is  a  bound  for  U^iTf  ,  and  hence  f/^-n-^  -^  0.  We 
deduce  as  before  that  F^  „  has  probability  zero  for  every  t.  Thus  the  cumulative 
distribution  of  F  is  continuous.  Let  F  =  F-^^  +  Fg,  where  Fj  is  absolutely  continuous 
and  Fg  is  singular.  Observing  that  the  transition  operator  transforms  absolutely 
continuous  measures  into  absolutely  continuous  measures  and  singular  measures  into 
singular  measures,  we  find  that  TF^  =  Fj  and  TF^  =  F^.  However,  as  the  fixed  dis- 
tribution is  unique,  we  deduce  that  either  F^  or  F^  vanishes. 

Theorem  40.  If  a  >  \  —  a,  then  the  unique  distribution  F<j  ^^  is  either  absolutely 
continuous  or  singular.    Furthermore,  F^ ^  has  positive  measure  in  every  open  interval. 

Proof.  We  have  demonstrated  all  the  conclusions  of  the  theorem  but  the  last. 
Let  -nit)  denote  a  continuous  function  bounded  by  1 ,  and  zero  outside  an  open  interval 
/,  and  1  on  a  closed  subinterval  /'  of  /.  By  virtue  of  Theorem  39  there  exists  an  n  such 
that  C/"77-  >  6  >  0  for  all  t.   We  note  that 

But 


1 


dF,^^  >  (n,  F,^^)  >8>0, 


and  the  proof  of  the  theorem  is  complete. 

We  close  with  the  conjecture  that  when  cr  >  1  —  a,  then  F^^^  is  always  ab- 
solutely continuous.  An  example  where  this  is  the  case  is  furnished  by  <f)(x)  =  1 12, 
cT  =  1/2  =  1  —  a,  where  F„  J^x)  =  x. 
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SOME  ASYMPTOTIC  PROPERTIES  OF  LUCE'S 
BETA  LEARNING  MODEL* 


John  Lamperti  and  Patrick  Suppes 

applied  mathematics  and  statistics  laboratories 
stanford  university 

This  paper  studies  asymptotic  properties  of  Luce's  beta  model.  Asymp- 
totic results  are  given  for  the  two-operator  and  four-operator  cases  of  con- 
tingent and  noncontingent  reinforcement. 

For  application  to  various  simple  learning  situations,  Luce  and  his 
collaborators,  Bush  and  Galanter,  [1,  7]  have  considered  a  learning  model  in 
which  the  changes  in  probability  of  response  from  trial  to  trial  are  not  linear 
functions  of  the  probability  of  response  on  the  preceding  trial.  Both  theoretical 
and  empirical  considerations  have  motivated  the  development  of  the  beta 
model.  Some  learning  theorists  like  Hull  and  Spence  believe  that  overt 
response  behavior  may  best  be  explained  in  terms  of  a  construct  like  that  of 
response  strength.  From  this  viewpoint  stochastic  learning  models  which 
postulate  a  linear  transformation  of  the  probability  of  response  from  one 
trial  to  the  next,  with  the  transformation  depending  on  the  reinforcing  event, 
are  unsatisfactory  in  so  far  as  they  offer  no  more  general  psychological  justi- 
fication of  their  postulates.  From  an  empirical  standpoint  there  is  evidence 
in  some  experiments,  particularly  certain  T-maze  experiments  with  rats, 
that  the  linear  stochastic  models  do  not  yield  good  predictions  of  actual 
behavior  [1,  7]. 

On  the  basis  of  some  very  simple  postulates  [7]  on  choice  behavior, 
Luce  has  shown  that  there  exists  a  ratio  scale  v  over  the  set  of  responses  with 
the  property  that 

where  p,  ,„  is  the  probability  of  response  A^  on  trial  n,  and  Vn{i)  is  the  strength 
of  this  response  on  trial  n.  Additional  simple  postulates  lead  to  the  result 
that  the  r„{i)  are  transformed  linearly  from  trial  to  trial,  and  this  unobservable 
stochastic  process  on  response  strengths  then  determines  a  stochastic  process 

*This  research  was  supported  in  part  by  the  Group  Psychology  Branch  of  the  Office 
of  Naval  Research  and  in  part  by  the  Rockefeller  Foundation. 

This  article  appeared  in  Psychometrika,  1960,  25,  233-241.    Reprinted  with  permission. 
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in  the  response  probabilities.  Superficially,  it  would  seem  that  the  simplest 
way  to  study  the  asymptotic  behavior  of  the  response  probabilities — a 
subject  of  interest  in  connection  with  nearly  any  learning  data — would  be 
to  determine  the  asymptotic  behavior  of  the  response  strengths  v„(i)  and 
then  infer  by  means  of  the  equation  given  above  the  behavior  of  the  response 
probabilities.  This  course  is  pursued  rather  far  by  Luce  [7]  and  encounters 
numerous  mathematical  difficulties.  We  have  taken  the  alternative  path  of 
studying  directly  the  properties  of  the  nonlinear  transformations  on  the 
response  probabilities  to  obtain  results  on  their  asymptotic  behavior. 

We  restrict  ourselves  to  situations  in  which  one  of  two  responses,  Ai 
and  A2  ,  is  made.  Let  p„  be  the  probability  of  response  Ai  on  trial  n,  and  let 
El  be  the  event  of  reinforcing  response  Ai  ,  and  E2  the  event  of  reinforcing 
response  A  2  . 

Luce's  beta  model  is  then  characterized  by  the  following  transformations: 
if  Aj  and  Ek  occurred  on  trial  n,  then  for  j  =  1,  2  and  k  =  1,  2, 

(1)  Pno  =  2n 


/)„    +    |S,-,(1     -    Pn) 


where  ^,k  >  0.  Luce  [7]  gives  a  more  general  formulation.  (Generally,  we 
want  /3,i  <  1  and  /3,2  >  1,  to  reflect  the  primary  effects  of  reinforcement; 
moreover,  it  is  ordinarily  assumed  that  /Sn  <  1821  <  /812  <  /322.)  Throughout 
this  paper  it  is  assumed  that  0  9^  pi  9^  1. 

The  most  important  fact  about  (1)  is  that  the  operators  commute.  For 
example,  suppose  in  the  first  n  trials  there  are  61  occurrences  of  AiEi  ,  62 
occurrences  of  A2E1  ,  63  occurrences  of  A1E2  ,  64  occurrences  of  A2E2  ;  then 
it  is  easily  shown  that 

CX)  ri       =  ^ 

Pi  +  /3ii/32i/3i2/822(l  -  Pi) 

The  aim  of  the  present  paper  is  to  study  asymptotic  properties  of  the 
beta  model  for  certain  standard  probabilistic  schedules  of  reinforcement. 
The  methods  of  attack  used  by  Karlin  [4]  and  by  Lamperti  and  Suppes  [6] 
for  linear  learning  models  do  not  directly  apply  to  the  nonlinear  beta  model. 

The  basis  of  our  approach  is  to  change  the  state  space  (the  probability 
Pn  is  the  state)  from  the  unit  interval  to  the  whole  real  line  in  such  a  way 
that  the  transformations  (1)  become  simply  translations.  The  noncontingent 
case  (the  next  section)  then  reduces  to  sums  of  independent  random  variables; 
the  contingent  cases  can  also  be  studied  by  "comparing"  the  resulting  random 
walks  with  the  case  of  sums  of  random  variables.  The  probabilistic  tool  for 
this  is  developed  and  applied  in  later  sections.  The  general  conclusion  to  be 
drawn  from  our  results  is  that  for  all  but  one  case  of  noncontingent  reinforce- 
ment individual  response  probabilities  are  ultimately  either  zero  or  one, 
which  is  in  marked  contrast  to  corresponding  results  for  linear  learning 
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models.  Absorption  at  zero  or  one  also  occurs  for  many,  but  not  all,  cases  of 
contingent  reinforcement. 

Noncontingent  Reinforcement  with  Two  Operators 

If  the  probability  of  a  reinforcement  is  independent  of  response  and 
trial  number,  we  have  what  is  called  simple  noncontingent  reinforcement. 
Let  TT  be  the  probability  of  an  Ei  reinforcement,  and  for  simplicity  let 


(3) 


/3n   =  /?2i   =  /?, 

/3i2  =  1S22  =  7, 

0  <  j8  <  1, 

I  7  >  1. 


We  seek  an  expression  for  the  asymptotic  probability  distribution  of  response 
probabilities  in  terms  of  the  numbers  x,  |8,  and  7. 

The  random  variable  rj^  is  defined  recursively  as  follows: 

1(3        with  prob  r, 

[y        with  prob  (1  —  tt)  ; 

_   \r)„^    with  prob  TT, 

[r]„y    with  prob  (1  —  tt). 

The  random  variable  X„  is  defined  as  follows: 

X„  =  log  ??„  . 
Then 
,  .  ^  Jz„  +  log  /3    with  prob  tt, 

[X„  +  log  7    with  prob  (1  —  tt). 

It  is  clear  from  (4}  and  what  has  preceded  that  X„  is  the  sum  of  n  independent 
identically  distributed  random  variables  F,  defined  by 

V    —  I  ^*^S  ^    with  prob  tt, 

[log  7    with  prob  (1  —  tt). 
By  the  strong  law  of  large  numbers,  with  probability  one  as  w  -^  0° 
/c^  X„  -^  00     if    7r  log  /3  +  (1  -  w)  log  7  >  0, 

Xn  —>  —  °^      if     TT  log  /3  +  (1  —  tt)  log  7  <  0. 
Define  now  for  any  real  number  x 


JOHN  LAMPERTI  AND  PATRICK  SUPPES  407 

Then  p„+i  =  FxSVi)  for  the  sequence  of  reinforcements  r7„ ,  where  X„  =  log  ??„  . 
These  results  are  utilized  to  prove  the  following  theorem. 

Theorem  1.  Let  c  =  w  log  (3  -\-  (1  —  ir)  log  y.  Then  with  probability  one 

fO    if    c>  0, 

[1     if    c  <  0. 

If  c  =  0,  then  Pn  oscillates  between  0  and  1,  so  that  with  probability  one 

lim  sup  p„  —  1 

lim  inf  p„  —  0. 

Despite  this  oscillation,  there  is  a  limiting  distribution  for  ;>„  ;  it  is  concentrated 
at  0  and  1  with  equal  probabilities  |. 

Proof.  The  results  for  c  >  0  and  c  <  0  follow  immediately  from  (5), 
(6),  and  the  remark  following.  In  case  c  =  0,  note  that  E{Yi)  =  0.  It  is 
known  [2]  that  the  sums  X„  are  then  recurrent;  that  is,  they  repeatedly 
take  on  values  arbitrarily  close  to  any  possible  value.  In  particular,  X„  takes 
on  repeatedly  arbitrarily  large  and  arbitrarily  small  values  (with  probability 
one),  which  upon  recalling  (6)  proves  the  second  statement.  The  third  state- 
ment is  a  consequence  of  the  central  limit  theorem,  which  implies  that  for 
any  A,  Pr(X„  >  A)  and  Pr(X„  <  —  A)  both  converge  to  one-half  as  n 
increases.  Again  the  assertion  of  the  theorem  follows  from  this  fact  and  (6). 

Two  Theorems  on  Random  Walks 

The  results  of  this  section  are  special  cases  of  those  in  [5].  However, 
the  present  approach  has  the  advantages  of  simplicity  and  directness. 
[  We  have  seen  that  the  two-operator,  noncontingent  beta  model  gives 

rise  to  a  Markov  process  on  the  real  line  such  that  from  x  the  "moving 
particle"  goes  to  a;  -|-  a  or  a;  —  6  with  (constant)  probabilities  (p  and  I  —  <p. 
The  contingent  case  leads  to  a  similar  process,  except  that  the  transition 
probal)ilities  become  functions  of  x.  The  four-operator  model  gives  rise  to  a 
process  with  four  possible  transitions,  from  x  to  x  -\-  ai  ,  say,  i  —  1,  2,  3,  4. 
In  this  section  some  simple  results  on  processes  of  these  sorts  will  be  obtained, 
in  preparation  for  the  study  of  the  more  general  cases  of  the  beta  model.  In 
the  interest  of  clarity,  only  the  two-operator  case  will  be  treated  in  full;  the 
more  general  case  can  be  handled  in  a  similar  way,  but  the  details  are  cumber- 
some. Our  approach  was  suggested  by  the  work  of  Hodges  and  Rosenblatt  [3]. 
Let  {X„}  be  a  real  Markov  process  such  that  if  X„  =  x, 

/gs  X  =     l^'    ~^    ^         "^^^^  ^^^^   ^*^^^  ' 

[x  —  b    with  prob  [1  —  (p{x)], 
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where  0  <  a,  b,  (fix),  1  —  (p(x).  Let  { F„}  be  another  process  of  the  same  type 
(and  with  the  same  a  and  b)  but  with  constants  6  and  1  —  ^  as  the  transition 
probabilities  in  place  of  (p(x)  and  1  —  (p(x). 

Lemma.  //  for  all  x  >  M,  one  has  (p(x)  >  6,  and  if  Pr(F„  -^  +  0°)  >  0, 
then  Pr(X„  ^  +  oo  )>  0.  //,  on  the  other  hand,  for  x  >  M,  <p{x)  <  6  and  if 
Pr(F„  ^  +  oo)  =  0,  then  Pr(X„  -^  +  oo)  =  0. 

Proof.  Let  {^„}  be  a  sequence  of  independent  random  variables,  each 
uniformly  distributed  on  [0,  1].  The  {X„}  process  will  be  referred  to  {^„} 
by  letting 

'X„  +  a     if    ^,,,1  <  <p{X„), 


(10)  X„., 

lX„  —  b    otherwise. 

This  does  lead  to  the  transition  law  (9)  as  may  easily  be  seen.  The  {F„} 
process  can  be  linked  to  {X„}  by  referring  it  after  the  manner  of  (10)  to  the 
same  sequence  {^„},  so  that  F„+i  =  F„  +  a  if  and  only  if  ^„+i  <  6. 

Choose  Fo  >  M.  Whatever  the  value  of  Xq  ,  since  (p(x)  >  0  there  is 
positive  probability  that  X„  >  Yq  for  some  m;  therefore  assume  Xq  >  Yq  . 
We  now  assert  that  for  those  sequences  {  F„ }  with  the  property  that  F„  >  M 
for  all  n,  the  inequality  X„  >  F„  is  also  valid  for  all  n.  This  follows  from  our 
construction  "linking"  the  processes,  and  the  assumption  that  (p{x)  >  6 
for  a;  >  M;  the  transition  X„+i  =  X„  —  6  and  F„+i  =  F„  +  a  is  impossible, 
so  X„  —  F„  can  only  increase. 

To  complete  the  proof,  note  that  since  Pr(-F„  ^4-  oo)  is  positive,  so  is 
Pr(F„  -^  +  00 ,  F„  >  M  for  all  n).  But  the  event  F„  ^  +  oo ,  F„  >  M  for  all 
n  may  be  considered  as  a  set  S  in  the  sample  space  of  the  sequence  {^„}; 
>S  is  a  set  of  positive  probability,  and  is  contained  in  the  set  X„  ^  oo  since 
on  S,  X„  >  F„  and  F„  -^  oo.  Hence  Pr(X„  -^  +  oo)  >  0.  The  second  part 
of  the  lemma  is  proved  in  a  similar  way,  using  the  same  construction  linking 
{X„]  and  {F„}. 

Theorem  2.  Let  b/{a  +  6)  =  c,  and  suppose  that 

(11)  lim  <p{x)  =  a     and     lim  ^(.r)  =  /3 

exist.  Then  if  a  <  c  and  fi  >  c, 

(12)  Pr(limsupX„  =  +  oo ,  lim  inf  X„  =  —  oo)  =1  ({X„|ts  recurrent), 
while  if  a  <  (>)  c  and  jS  <  (>)c,  then 

(13)  Pr(X„-^-a>  (4-co))  =  1. 

Finally,  if  a  >  c  and  ^  <  c, 

(14)  Pr  (X„  -^  4-  oo)  =  5,        Pr  (Z„  -^  -  oo)  =  1  -  5 
for  some  0  <  6  <  1 . 
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Proof.  Suppose,  for  instance,  that  a  <  c.  Let  { F„ }  (as  in  the  lemma) 
be  a  process  with  constant  transition  probabiUties  6  and  1—6  where 
a  <  d  <  c.  The  { F„ }  process  may  be  regarded  as  sums  of  random  variables 

Yn  =  Yo  +  J2Zi  ,    where    Pr  (Z,-  =  a)  =  6    and 
(15) 

Pr(Z,-  =  -h)  =  i  -  e. 

But   E(Z,)    =    ad    -    6(1    -    ^)    <    0,    since   9    <    c;  this  impUes   that 
Pr(F„  -^   —    cd)    =    1  by  the  law  of  large   numbers.    From  the  lemma, 
Pr(Z„-^  4-  oo)  =  0. 
I  Similarly,  if  a  >  c  it  follows  that  Pr(X„  — ^  +  » )  >  0.  Since  the  lemma 

also  holds  for  convergence  to  —  oo  (with  <p  and  6  replaced  by  1  —  (p  and 
1  —  6),  we  obtain  in  the  same  way  that  /3  <  c  makes  Pr(X„  -^  —  ex?)  >  0, 
while  if  j8  >  c  this  probability  is  zero. 

Consider  the  case  when  a  <  c  and  /?  <  c;  there  is  then  positive  probability 
of  absorption  at  —  0° ,  but  not  at  +  oo .  It  is  not  hard  to  see  that  X,  -^  —  oo 
with  probability  one;  the  idea  is  roughly  as  follows.  Since  X„  -^  +  oo  ^  we 
have  Xn  <  N  infinitely  often  with  probability  arbitrarily  close  to  1  for  some 
N.  Now  the  probability  that  from  or  to  the  left  of  N  the  random  walk  goes  and 
remains  to  the  left  ol  N  —  M  must  be  positive  since  Pr(X„  — >  —  oo )  >  0. 
But  in  an  infinite  sequence  of  not  necessarily  independent  trials,  an  event 
whose  probability  on  each  trial  is  bounded  away  from  zero  is  certain  to 
occur.  Hence  for  any  M,  the  random  walk  will  eventually  become  and  remain 
to  the  left  oi  N  —  M,  and  therefore  X„  -^  —  oo  with  probability  arbitrarily 
close  to  1  (and  so  equal  to  one).  The  other  cases  are  similar;  one  can  think 
of  a  >  c  or  a  <  c  as  the  conditions  under  which  +  oo  is  an  absorbing  or 
reflecting  barrier,  etc.,  and  the  process  behaves  accordingly. 

The  generalization  to  the  four-operator  case  will  now  be  described.  Let 
{X„}  be  a  real  Markov  process  such  that  if  X„  =  x,  then 

(17)  X„+i  =  a:  +  a,     with  prob  v',(a;), 
where  .Oi  ,  aa  >  0  >  a;,,  at   and  (Pi(x)  >  0.  Suppose 

(18)  lim  <pXx)  =  ai     and     lim  <pi{x)  =  /S,- 

2— »+a3  I— •—03 

exist,  and  let 

4  4 

P  M+  =   ^  o,q:,     and    m-  =   Zl  (^i^i  • 

By  methods  entirely  similar  to  those  used  above,  but  rather  more  involved, 
it  is  possible  to  prove  the  following. 

Theorem  3.  For  the  process  [X„}  described  above,  if  /i+  <  0  and  /x_  >  0 
then  (12)  holds;  if  ijl+  <  (>)0  and  n-  <  (>)0  then  (13)  applies;  while  if  m+  >  0 
and  M-  <  0,  (14)  is  valid. 
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Contingent  Reinforcement  with  Two  Operators 

If  the  probability  of  reinforcement  depends  only  on  the  immediately 
preceding  response  (on  the  same  trial),  one  has  {simple)  contingent  reinforce- 
ment. Let  Pr(£'i  |  ^i)  =  tti  and  Pr(£'i  |  ^2)  =  7r2  ,  and  let  the  two  operators 
i8  and  7  be  specified  as  in  (3).  Using  (6),  define  the  random  variable  X„ 
recursively.  (Note  that  log  7  appears  first,  since  log  7  >  0  and  log  /8  <  0, 
in  order  most  directly  to  apply  Theorem  2.) 


(19)        X„,i  = 


X,.  +  log  7    with  prob  FxM{l  -  tti) 

+    (1    -    FxM){\    -   TT^)    =   ^(X„), 

,X„  +  log  ^    with  prob  [1  -  ^(X„)]. 


Observe  that 

(20)  lim  (p{x)  =  1  —  7r2     and     lim  <p{x)  =  1  —  tti  . 

Combining  (20)  and  Theorem  2,  one  then  has  immediately  Theorem  4. 

Theorem  4.  For  the  contingent  case  of  the  two-operator  model,  let  c  = 
—  log  /3/log  {y/fi) .  Then  with  probability  one 

(i)    if  \  —  Ti  <  c  and  1  —  tti  >  c  then 

lim  sup  Pn  =  I     and     lim  inf  p„  =  0, 

n  n 

(ii)   if  1  —  T2  <  c  and  1  —  tti  <  c  then  p^  =  1, 
(iii)  if  1  —  ^2  >  c  and  1  —  tti  >  c  then  p^  =  0. 
Moreover, 

(iv)  if  \  —  TT2  >  c  and  1  —  tti  <  c  then  for  some  5  with  0  <  5  <  1 
Pr  (p„  -^  1)  =  5,         Pr  (p„  -^  0)  =  1  -  5. 

The  intuitive  character  of  the  distinction  between  the  results  expressed 
in  (i)  and  (iv)  of  this  theorem  should  be  clear.  If  1  —  ttz  <  c  and  1  —  tti  >  c, 
then  probability  zero  of  an  ^i  response  and  probability  one  of  an  ^1  response 
are  both  reflecting  barriers,  whereas  if  1  —  xa  >  c  and  1  —  xi  <  c,  they  are 
both  absorbing  barriers. 

It  is  also  to  be  noticed  that  except  when  1  —  tti  =  c  or  1  —  ttj  =  c, 
Theorem  4  covers  all  values  of  jS,  7,  tti  ,  and  7r2  for  the  contingent  case.  It  can 
be  shown  [5]  by  deeper  methods  that  if  1  —  tti  =  c  (or  1  —  7r2  =  c)  then 
probability  one  (respectively  zero)  of  an  A^  response  is  again  a  reflecting 
barrier.  These  results  agree  with  those  given  by  Luce  ([7],  p.  124)  and  in 
addition  settle  most  of  the  open  questions  in  his  Table  6.  Detailed  comparison 
is  tedious  because  his  classification  of  cases  differs  considerably  from  ours  as 
given  in  the  above  theorem. 
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Contingent  Reinforcement  with  Four  Operators 

We  want  finally  to  apply  Theorem  3  to  the  contingent  case  of  the  general 
four-operator  model  formulated  in  (1).  Analogous  to  (19), 


(21) 


Also, 


X„+i  —  -< 


Xn  +  log  i322  with  prob  (1  —  7r2)(l  —  FxXpi))  =  ^22(^0, 

Z„  +  logjSi2  with  prob  (1  -  7ri)Fx„(pi)  =  <Pi2{Xr,), 

X„  +  log  (S21  with  prob  7r2(l  -  Fx„(P^))  =  ¥>2i(X„), 

Z„  +  log /3n  with  prob  x,Fy„(pi)  =  (Pn{X,). 


lim  (P22ix)  =  1—1:2 

I-4+CO 

lim  (Pj2(x)  =  0, 

lim  ^21(3-)  =  TTa  , 

lim  <pn{x)  =  0, 


lim  <P22  =  0, 

lim  (pi2ix)  =  1  —  TTi  , 

lim  ^21(2^)  =  0, 

lim  <Pu(.x)  =  TTi  . 


M+  =   2  log  /3,i  lim  tpikix)  =  TTz  log  /321  +  (1  -  ttz)  log  /322  , 


(22) 

f 

Then 

(23) 

i 

and 

(24)         M-  =   S  log  /3yfc  lim  <Pik{x)  =  tt,  log  13^  +  (1  -  ttJ  log  /3i2  . 

To  apply  Theorem  3  one  also  assumes  that  ^^22  ,  /3i2  >   1  >  1821  ,  /?n  >  0. 
On  this  assumption,  and  utilizing  (23)  and  (24),  we  infer  Theorem  5. 

"Theorem  5.     For  the  contingent  case  of  the  four-operator  model,   with 
probability  one 

(i)     if  n+  <  0  and    n-  >  0  then  lim  sup  Pn  —  I    and    lim  inf  p„  =  0, 
(ii)   if  n+  <  0  and  /x_  <  0  then  7)0=  =  1, 
(iii)  if  /z+  >  0  and  m-  >  0  then  pa>  =  0; 
and  if  m+  >  0  and  m-  <  0,  then  for  some  8  with  0  <  5  <  1 
(iv)  Pr(p„  -^  1)  =  5,  Pr(p„  -^  0)  =  1  -  5. 
I      Specialization  of  this  theorem  to  cover  the  noncontingent  case  is  immediate. 


412  READINGS  IN  MATHEMATICAL  PSYCHOLOGY 

REFERENCES 

[1]  Bush,  R.  R.,  Galanter,  E.,  and  Luce,  R.  D.  Tests  of  the  "beta  model."  In  R.  R.  Bush 

and  W.  K.  Estes  (Eds.),  Studies  in  mathematical  learning  theory.  Stanford:  Stanford 

Univ.  Press,  1959.  Ch.  18. 
[2]  Chung,  K.  L.  and  Fuchs,  W.  H.  J.  On  the  distribution  of  values  of  sums  of  random 

variables.  Mem.  Amer.  Math.  Soc,  1951,  6,  1-12. 
[3]  Hodges,  J.  L.  and  Rosenblatt,  M.  Recurrence  time  moments  in  random  walks.  Pac. 

J.  Math.,  1953,  3,  127-136. 
[4]  Karlin,  S.  Some  random  walks  arising  in  learning  models  I.  Pac.  J.  Math.,  1953,  3, 

725-756. 
[5]  Lamperti,  J.  Criteria  for  the  recurrence  or  transience  of  stochastic  processes  I.  J. 

math.  Anal.  Applications,  (in  press). 
[6]  Lamperti,  J.  and  Suppes,  P.  Chains  of  infinite  order  and  their  application  to  learning 

theory.  Pac.  J.  Math.,  1959,  9,  739-754. 
[7]  Luce,  R.  D.  Individual  choice  behavior.  New  York:  Wiley,  1959. 

Manuscript  received  4/^7/59 

Revised  manuscript  received  11/10/59 


CHAINS  OF  INFINITE  ORDER  AND  THEIR 
APPLICATION  TO  LEARNING  THEORY 

John  Lamperti  and  Patrick  Suppes 

1.  Introduction.  The  purpose  of  this  paper  is  to  study  the  asym- 
ptotic behavior  of  a  large  class  of  stochastic  processes  which  have  been 
used  as  models  of  learning  experiments.  We  will  do  this  by  applying 
a  theory  of  so-called  "chains  of  infinite  order"  or  "chaines  a  liaisons 
completes."  Namely,  we  shall  employ  certain  limit  theorems  for  sto- 
chastic processes  whose  transition  probabilities  depend  on  the  entire  past 
history  of  the  process,  but  only  slightly  on  the  remote  past.  Such  theo- 
rems were  given  by  Doeblin  and  Fortet  [3]  in  a  form  close  to  that  we 
employ;  however,  in  order  to  accomodate  certain  cases  of  learning  models 
we  found  it  necessary  to  relax  somewhat  their  hypotheses.  A  self-con- 
tained discussion  of  these  and  some  additional  results  is  the  content  of  §2. 

We  should  emphasize  that  this  section  is  included  to  serve  as  prep- 
aration for  the  theorems  of  §  4,  and  it  is  original  with  us  only  in  some 
details  and  extensions.  In  addition  to  [3],  papers  by  Harris  [7]  and 
Karlin  [8]  contain  very  closely  related  results  and  arguments,  but  not 
quite  in  the  form  we  require. 

The  processes  which  we  shall  study  with  these  tools  are  called  "linear 
earning  models."  From  a  psychological  standpoint  these  models  are 
very  simple.  A  subject  is  presented  a  series  of  trials,  and  on  each 
trial  he  makes  a  response,  which  consists  of  a  choice  from  a  finite  set 
of  possible  actions.  This  response  is  followed  by  a  reinforcement  (again 
one  of  a  finite  number).  The  assumption  of  the  model  is  that  the  sub- 
ject's response  probabilities  on  the  next  trial  are  linear  functions  of  the 
probabilities  on  the  present  trial,  where  the  form  of  the  functions  de- 
pends upon  which  reinforcement  has  occurred.  Many  results  about  such 
models  may  be  found  in  Bush  and  Hosteller  [2],  Estes  [4],  and  Estes  and 
Suppes  [6].  We  will  also  study  here  models  constructed  along  similar 
lines  for  experiments  involving  two  or  more  subjects  and  a  type  of  in- 
teraction between  them  [6,  Section  9]  and  Atkinson  and  Suppes  [1]. 
Precise  definitions  of  these  processes  are  given  below  in  §3. 

The  references  mentioned  above  do  not,  except  in  very  special  cases, 
give  a  thorough  treatment  of  asymptotic  properties.  We  shall  prove 
that  under  general  conditions  linear  learning  models  exhibit  "ergodic" 
behavior;  that  is,  that  after  much  time  has  passed  these  processes  be- 
come approximately  stationary  and  the  infiuence  of  the  initial  distributions 
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goes  to  zero.  This  is  not  the  case  for  all  models  which  have  been 
used  in  experimental  work,  but  it  seems  as  if  ergodic  behavior  can  be 
proved  by  our  method  in  almost  all  the  cases  in  which  one  might  expect 
it.  Our  theorems  to  this  eifect,  their  proofs  and  some  corollaries  are 
given  in  §4. 

The  major  work  so  far  on  limiting  behavior  of  learning  models  is 
Karlin  [8],  who  obtains  detailed  limit  theorems  for  certain  classes  of 
models.  However,  the  results  and  even  the  techniques  of  Karlin's  paper 
do  not  apply  to  many  cases  of  interest.  His  starting  point  is  a  repre- 
sentation of  the  linear  model  as  a  Markov  process  whose  states  are  the 
response  probabilities.  Two  typical  situations  when  such  a  representa- 
tion is  impractical  arise  (i)  when  the  probabilities  with  which  the  rein- 
forcement is  selected  depend  on  two  or  more  previous  responses,  and  (ii) 
in  the  many-person  situations  mentioned  above.  Both  these  situations 
can  (and  will)  be  studied  using  infinite  order  chains,  and  ergodic  behavior 
estabhshed  under  mild  restrictions.  On  the  other  hand,  Karlin's  work 
treats  interesting  non-ergodic  cases  outside  the  scope  of  our  approach.  For 
example,  consider  a  T-maze  experiment  in  which  the  subject  (a  rat,  say) 
is  reinforced  (rewarded)  on  each  trial  regardless  of  whether  he  goes  left 
or  right.  In  the  appropriate  linear  model,  the  probability  of  a  left  turn 
eventually  is  either  nearly  0  or  nearly  1,  and  which  it  is  depends  upon 
the  rat's  initial  response  probabilities.  The  model  of  this  experiment 
has  been  thoroughly  studied  in  [8,  Section  2],  and  these  results  have 
been  generalized  by  Kennedy  [9]. 

In  conclusion  we  comment  that  both  more  detailed  results  and  other 
applications  seem  possible  using  the  ideas  of  "infinite  order  chains." 
We  hope  to  contribute  further  to  this  development  in  the  future. 

2.  Chains  of  infinite  order.  In  this  section  we  present  a  theory 
of  non-Markov  stochastic  processes  where  the  transition  probabilities  are 
influenced  only  slightly  by  the  remote  past.  The  original  convergence 
theorems  for  this  type  of  process  are  due  to  Doeblin  and  Fortet  [3]; 
they  are  given  here  in  a  generahzed  form  (^Theorems  2.1  and  2.2).  The 
weaker  hypotheses  make  the  proof  of  Lemma  2.1  more  complicated  than 
it  is  in  [3],  but  the  other  proofs  are  not  much  affected.  T.  E.  Harris 
has  also  studied  these  chains;  we  shall  not  use  his  results  but  remark 
that  his  paper  [7]  gives  additional  references  and  background  on  the 
subject.  Finally  we  point  out  that  the  restriction  to  a  finite  number  of 
states  is  not  essential,  and  the  theorems  can  be  extended  to  the  de- 
numerable  case  without  much  change  of  methods. 

Let  /  consist  of  the  integers  from  1  to  N  (to  represent  the  states 
of  the  chain);  we  shall  use  the  notation  x  for  a  finite  sequence  ia,  h,  •  •  ■ 
of  integers  from  /.    The  subscript  "m"  on  x„,  merely  adds  the  specifica- 
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tion  that  the  sequence  has  m  terms;  the  "sum"  a;„  +  x'  will  be  the 
combined  sequence  i^,  •  •  • ,  i^_i,  i'^,  i[  •  •  • .  The  starting  point  for  the  theory 
will  be  a  set  of  functions  Viix)  defined  for  all  ie  I  and  all  sequences  x 
(including  the  sequence  -p  of  length  zero)  and  having  the  properties 

(2.1)  Vi{x)  ^  0,  S  v,{x)  =  1  . 

i 

The  function  Piix)  will  be  interpreted  as  the  conditional  probability  that 
a  path  function  of  the  random  process  will  go  next  to  state  i,  having 
just  occupied  state  %,  previously  i^,  etc.  With  this  interpretation  in 
mind  we  define  inductively  the  "higher  transition  probabilities": 

(2.2)  v'r{x)  =  s  vA^wr'Kj  +  X) , 

where  of  course  p^^^x)  =  Pi(x),  the  given  function.  It  is  easy  to  see  that 
these  higher  probabilities  also  satisfy  condition  (2.1).  The  functions 
p^"\x)  are  the  analogues  of  the  terms  of  the  matrix  P"  for  a  Markov 
chain  with  transition  matrix  P;  the  theorems  we  shall  give  generalize 
the  convergence  properties  of  the  matrices  P". 

We  shall  first  impose  a  positivity  condition  on  the  transition  proba- 
bilities; specifically  we  assume  that  for  some  state  j\,  some  positive 
integer  n^,  and  some  S  >  0, 

(2.3)  p'yix)  >  S  for  every  x  . 

We  also  need  to  make  precise  the  "slight"  dependence  of  these  proba- 
bilities on  the  remote  past;  indeed,  this  is  the  crux  of  the  whole  theory. 
Define 

(2.4)  £„,  =  sup  \Pi,{x  +  x')  -  Pi{x  +  x")\ 
where  the  sup  is  taken  over  all  states   i,    all  sequences   x'   and  x" ,  and 
all  sequences  x  which  contain  the  state  j^   at  least  m  times.     We  shall 
use  the  postulate 

(2.5)  ±  £„,  <  0=  . 

ra  =  0 

(In  [3],  £,„  is  defined  in  the  same  way  except  that  the  sup  is  taken  over 
all  X  of  length  at  least  m.  Since  this  results  in  larger  £'„s,  and  since 
it  is  also  assumed  there  that  V,  s^  <  oo ,  our  hypotheses  are  strictly 
weaker.)     Throughout  this  section,  (2.3)  and  (2,5)  will  be  assumed. 

Lemma  2.1. 

(2.6)  lim  [sup  |2)i"'(a^  +  ^')  -  vT\x  +  x")W  =  0  , 

OT->oo 

where  the  sup  is  the  same  as  in  (2.4)  (i.e.,  x  contains  j^  at  least  m 
times);  the  convergence  is  uniform  in  n. 
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Proof.  We  define  quantites  £;,f^  by  using  y>^*'  instead  of  jPi  in  (2.4); 
then  of  course  eS,!^  =  £„„  and  the  conclusion  of  the  lemma  is  equivalent 
to  £m'  -*  0  uniformly  in  k  as  m  ^^  co.     Now 

\v^^\x  +  x')  -  v')!'\x  +  x")  I 
=  1  2  Wi'~''\3  +  i»  +  x')'Pj{x  +  x')  —  pl*~'Xy  +  a:  +  cc")pj(a;  +  a?")}  | 

.7 

^  S  ^'X^'  +  x')\v?~'\3  +  X  -^  x')-  vl'-'^ij  +  x  +  x")\ 

Suppose  that  x  contains  j^  m  times.  Then  the  second  term  of  the  above 
estimate  is  less  than  Ne,,,.  The  absolute  value  in  the  first  term  is  less 
than  £'-m~^\  but  if  j  =  j^  this  can  be  improved  to  £„;/'.  Taking  account 
of  (2.3)  and  assuming  that  ^o  =  1,  we  obtain  the  estimate 

(2.7)  el'^  ^  Ne,,  +  8el^;,'>  +  (1  -  S)£<i-"  . 

(In  case  Uq  >  1,  the  same  idea  can  be  carried  out;  the  details  are  more 
cumbersome  and  will  not  be  given.) 

Now  (2.7)  can  be  iterated  to  obtain  an  estimate  of  £S,f'  in  terms  of 
£^.     After  some  computation  the  result  is 

£'„f^  ^  Ne,Jza  -  8y  +  Ns.n,./£ii  +  i)(i  -  sy 

+  . . .  +  Ne^,,S'  'g'('  I  ^)(1  -  S)^  +  •  •  •  +  iVS^-^£,..._,  . 

If  the  series  are  extended  to  infinity,  the  inequality  remains  true;  call- 
ing these  (infinite)  series  A^,  A^,  •  •  • ,  A^--^  we  have 

e':^^N^e,,,,8^A,. 

1  =  0 

But  it  can  be  shown  without  much  diflficulty  that 

A,,,  -  A,  =  (1  -  8)A^,,  , 
or  A,+i  =  AijS.     Since  A^  —  S''  we  obtain  Ai  =  8-'-'-'^\  and  hence 

(2.8)  £^r  ^S-S£,,,. 


RecalHng  hypothesis  (2.5),  the  uniform  convergence  of   £5«^  follows  from 
(2.8). 

Lemma  2.2. 
(2.9)  lim  \p'r\x')  -  p\"\x")\  =  0 

and  the  convergence  is  uniform  in  x'  and  x" . 
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Proof.  For  clarity  we  shall  use  probabilistic  arguments,  although  a 
purely  analytic  rephrasing  is  not  hard.  Consider  two  stochastic  processes 
operating  independently  with  transition  probabilities  Piix),  one  with  the 
sequence  x'  for  its  past  history  up  to  time  0  and  the  other  with  x" . 
In  view  of  Lemma  2.1,  for  any  £  >  0  there  is  an  m  such  that  if  the 
two  processes  have  occupied  the  same  states  for  a  period  which  includes 
ja  at  least  m  times  and  ends  sometime  before  time  n,  then  their  proba- 
bilities of  being  in  state  i  at  time  n  differ  by  at  most  e/2.  But  it  fol- 
lows from  condition  (2.3)  that  with  probability  one,  there  will  sometime 
be  a  period  of  length  m  during  which  both  processes  remain  in  state  j^. 
We  can  take  n  large  enough  so  that  this  simultaneous  "run"  of  state 
io  will  occur  before  time  n  with  probability  not  less  than  1  —  s/2.  For 
this  and  all  greater  values  of  n,  therefore,  the  two  processes  have  proba- 
bilities of  occupying  state  i  at  time  n  which  differ  by  at  most  e,  and 
this  proves  (2.9).  It  is  also  easy  to  see  from  (2.3)  and  Lemma  2.1  that 
n  can  be  chosen  uniformly  in  x'  and  x" . 

With  this  much  preparation  we  shall  now  prove  the  first  theorem: 

Theorem  2.1.     The  quantities 
(2.10)  lim  pl"'(a;)  =  t^i 

exist,  are  independent  of  x,  and  satisfy   X  ^i  —  1/   ^^^^   convergence  is 

I 

uniform  in  x. 

Proof.     Applying  (2.2)  repeatedly,  we  have 

=  £  Pi^_^{x)p,^_J,i^-,  +  x)"'  p,^(i,  +  •  •  •  +  in-^  +  x)p'r\x,,  +  x) 

where  x^  =  i^,  i^,  •  •  • ,  i^-^.     Therefore 

\p[^^'^>(x)  -  p["\x)\ 

^  S  Pi,^J^)  '  •  •  Piji'^i  +  •  •  •  +  im-i  +  x)\p','^\x^  +  x)-  p'r\x)\ 

and  by  Lemma  2.2,  for  any  £  there  is  an  n  such  that  each  term  within 
absolute  value  signs  on  the  right  is  less  than  £.  Since  the  weights 
^'m-i(^)  * '  *  ^io(^i  +  •  •  •  +  '^m-i  +  ^)  sum  to  one,  we  have 

\p[''^"'\x)  -  p["'ix)\  <  £  , 

and  so  p'"\x)  has  a  (uniform  in  x)  limit  tt^.  Since  there  are  a  finite 
number  of  states, 

E  TTj  =  2  lim  pl"\x)  =  lim  £  p'^\x)  =  1  , 

i  i      m->oo  n-»«i 
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and  this  completes  the  proof. 

Next  we  shall  define  joint  probabilities.     If  x^  is  \,\,  •••,i^_j,  let 

(2.11)    v.S''')  =  ^^::(^') 

This  is,  of  course,  the  probabilitity  of  executing  the  sequence  of  states 
x,r,  starting  with  past  history  x' .  We  can  define  also  the  higher  joint 
probabilities: 

(2.12)  pL"'(a;')  =  S??Xa;')Pl."""(i +  »;'). 

Analogues  of  Lemmas  2.1  and  2.2  can  be  proved  for  these  quantities  by 
the  same  arguments  used  already;  in  this  way  it  is  not  difficult  to  prove 

Theorem  2.2.     The  quantities 

(2.13)  Urn  p':''{x')  =  TT, 

exist,  are  independent  of  x',  and  satisfy  Xi  ^^  —  1/  ^^^  convergence 
is  uniform  in  x'. 

Remark.  These  two  theorems  imply  the  existence  of  a  stationary 
stochastic  process  with  the  Pi{x)  for  transition  probabilities.  The  idea 
is  that  the  quantities  tt^  can  be  used  to  define  a  probability  measure 
on  the  "cylinder  sets"  in  the  space  of  infinite  sequences  of  members 
of  /,  and  this  measure  can  then  be  extended.  This  stationary  process 
need  not  concern  us  further  here. 

Finally  we  will  prove  convergence  theorems  for  certain  "moments" 
which  are  useful  in  studying  experimental  data.  The  idea  is  that  if  we 
have  a  stochastic  process  with  the  functions  Pi(x)  for  transition  proba- 
bilities, the  probability  Pi(x,J  that  the  state  at  time  m  is  i  given  the 
past  history  x^  is  itself  a  random  variable,  and  so  it  makes  sense  to 
study  E{p]{x,J).     More  formally,  define 

(2.14)  a]{m,  x)  =     V,     p]'(r„,  +  x)p^  (x) 

where  PzJ,^)  is  defined  by  (2.11).  Thus  al(m,  x)  is  the  same  as  p["'\x). 
Theorem  2.1  states  that  lim«)(m,  x)  —  tz^  exists.     We  shall  now  prove 

Theorem  2.3.  The  quantities 

(2.15)  \m\.a\(m,x)  =  a) 

exist  for  every  positive  integer  v;  convergence  is  uniform  in  x  and  the 
limit  is  independent  of  x. 


JOHN   LAMPERTI  AND  PATRICK  SUPPES  419 

Proof.  We  use  a  simple  estimate  to  show  that  a]{m,  x)  is  a  Cauchy 
sequence: 

\a'^{m  -h  k  +  h,  x)  —  nl{m  +  k,  x)\ 

+  S  lP'(a^™+^  +  x)  -  pKa^,„  +  x)\p^      ix) 

^vi  +  k 
^m  +  k  +  li  -^m  +  k 

If  m  is  chosen  large  enough,  the  first  two  terms  will  be  arbitrarily 
small;  this  involves  nothing  more  than  the  conditions  (resulting  from 
(2.3)  and  (2.5))  that  e^  -^  0,  and  that  a  long  sequence  x  contains  jo  many 
times  with  high  probability.  The  last  term  may  be  rewritten  by  carry- 
ing out  the  summation  over  all  the  indices  except  those  in  x^,;  this  yields 

I S  p}{x^  +  xXp'J^J''^  (x)  -  p'^'J^ix))  I  ^  S  1  pC''(^)  -  ^^'J^)  1 

which  is  small  for  all  h  (and  for  all  x)  if  k  is  large  enough,  by  Theorem 
2.2.  Thus  if  n  =  m  +  fc,  \al{n  +  h,  x)  —  a]{n,  x)\  is  small  for  all  h,  and 
this  proves  that  the  hmit  (2.15)  must  exist;  the  limit  is  uniform  in  x 
since  a\{m,  x)  is  uniformly  Cauchy.  Another  estimate  along  much  the 
same  line  can  be  made  to  show  that  for  any  £  >  0, 

\al{m  +  k,  x)  —  a}(m  +  k,  x')\  ^  e 

provided  m  and  k  are  large.  Since  the  limit  of  a]{m  +  k,  x)  exists  as 
m  +  /c->  oo,  we  can  conclude  that  the  limit  is  the  same  for  all  x. 

It  is  also  desirable  to  consider  some  additional  "cross"  moments 
involving  Pi{x,„)  for  several  states  at  once;  accordingly  we  define 

(2.16)  a}Jf.'.]lim,  x)  =  S  p}jix,„  +  x)p;](,x,,  +  x)  •  •  •  p]l(,x,,  +  x)P:,J,x)  . 

The  following  theorem  is  then  a  generalization  of  Theorem  2.3,  which 
treats  the  case  k  —  1: 

Theorem  2.4.     The  quantities 

(2.17)  lim  aj^\';.'j*(m,  x)  =  oc]'^'Zl 

'     exist  uniformly   in   x  for  all   non-negative   integers   Vj  •  •  •  v^   and   all 
;     ji  '  • '  jk  €  /,  and  the  limits  are  independent  of  x. 

kThe  argument  used  in  proving  Theorem  2.3  works  in  this  case  also 
with  only  trivial  changes,  and  need  not  be  repeated.  Finally  we  remark 
that  moments  involving  several  values   of  n  can   be  considered,  and  it 
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can  be  shown  that  their  Hmits  exist  also.     This    provides   a   generahza- 
tion  of  Theorem  2.2. 

3.  Definition  of  linear  learning  models.  The  models  we  consider 
apply  to  an  experimental  situation  which  consists  of  a  sequence  of  trials. 
On  each  trial  the  subject  of  the  experiment  makes  a  response,  which  is 
followed  by  a  reinforcing  event.  Thus  an  experiment  may  be  represented 
by  a  sequence  (^i,  E^,  A.,  E^,  •  •  •  A^,  E^,  •  •  •)  of  random  variables,  where 
the  choice  of  letters  follows  conventions  established  in  the  literature: 
the  value  of  the  random  variable  A„  is  a  number  j  representing  the 
actual  response  on  trial  n,  and  the  value  of  E„  is  a  number  k  represent- 
ing the  reinforcing  event  on  trial  n.  The  relevant  data  on  each  trial 
may  then  be  represented  by  an  ordered  pair  (j,  k)  of  integers  with 
1  ^  j  ^  r,  and  0  ^  k  ^  t,  that  is,  we  envisage  in  general  r  responses 
and  i  +  1  reinforcing  events.  Any  sequence  of  these  pairs  of  integers 
is  a  sequence  of  values  of  the  random  variables  and  thus  represents  a 
possible  experimental  outcome.  The  general  aim  of  the  theory  is  to 
predict  the  probability  distribution  of  the  response  random  variable  when 
a  particular  distribution,  or  class  of  distributions,  is  imposed  on  the  re- 
inforcement random  variable. 

In  dealing  with  the  general  linear  model  with  r  responses  and 
t  -\-  1  reinforcing  events  we  are  following  the  formulation  in  Chapter  1 
of  Bush  and  Hosteller  [2],  although  our  notation  is  somewhat  different, 
being  closer  to  Estes  [4]  and  Estes  and  Suppes  [6]. 

The  theory  is  formulated  for  the  probability  of  a  response  on  trial 
n  +  1  given  the  entire  preceding  sequence  of  responses  and  reinforce- 
ments.    For  this  preceding  sequence  we  use  the  notation  a;„.     Thus 

(It  is  convenient  to  write  these  sequences   in   this   order,    but  note  that 
the  numbering  here  is  from  past   to  present,  not  the  reverse  as  in  §2.) 
Our  single  axiom  is  the  following  linearity  assumption: 
Axiom  L.     If  En  —  k  and  P{x„)  >  0  then 

(3.1)  P{A„,,  =  j\x,)  =  (1  -  e,)P{A„  =  j\x,^,)  +  e,Xj, , 

where  0  ^  6^,  Xj,,  ^  1  and  S  ^jk  =  1- 

We  obtain  the  linear  model  studied  intensitively  in  [6]  by  setting: 

1^  =  6  f  or  fc  9^  0 

1^  =  0  for  /c  ==  0 

(3.2)  \^jj  =  l 

Xj^  =  0  foYJ-^k 
>     t  =  r  . 
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A  linear  model  satisfying  (3.2)  we  shall  term  an   Estes   Model,   and  for 
such  models  (3.1)  may  be  replaced  by  the  simpler  condition: 

f{l-e)P{A„  =  j\x„-,)  +  0    HE,  ^3 

(3.3)  P(^„.,  =  3\Xn)  =    (1  -  0)P{A„  -  ilx„_0  if  E^^k,  ki^O,  ki^j 

(p(A„  =  ib„-0  if^„  =  0. 

Axiom  L  satisfies  the  combining  classes  condition  of  Bush  and 
Hosteller.  Upon  replacing  ^  by  1  -  a  in  (3.1)  essentially  their  general 
formulation  of  the  linear  model  is  obtained,  although  they  do  not  ex- 
plicitly indicate  dependence  on  the  sequence  a;„. 

We  also  define  here  certain  moments  which  are  of  experimental 
interest  and  whose  asymptotic  properties  we  investigate  subsequently. 
The  moments  a)^,  of  the  response  probabilities  at  trial  n  are: 

(3.4)  a),,  =  S  P^(A„  =  j  1  a;„_OP(x„-i)  . 

And  if  the  appropriate  limits  exist,  we  define 

(3.5)  a)  =  lim  a)^„  . 

The  moments  (3.4)  are  formed  in  an  unsymmetrical  way;  however, 
they  enter  in  a  natural  way  in  the  expression  of  quantities  which  are 
easily  observed  experimentally — for  instance,  the  joint  probability 
P{A„+i  =  j,  An  =  j).     (For  other  examples,  see  [6].) 

We  are  also  interested  in  studying  extensions  of  the  linear  model 
to  multiperson  situations.  We  may  suppose  that  we  have  s  subjects  in 
a  situation  such  that  the  probability  of  a  particular  reinforcing  event  for 
any  one  subject  will  depend  in  general  on  preceding  responses  and  re- 
inforcements of  the  other  s  —  1  subjects  as  well  as  on  his  own  prior 
responses  and  reinforcements.  The  data  on  each  trial  may  then  be  re- 
presented by  an  ordered  2s-tuple  {ji,  k^,  -  •  • ,  j\,  k^)  of  integers  with 
1  ^  ii  ^  Ti,  0  ^  ki  ^ti,  ior  i  =  1,  •■-,  s,  and  any  sequence  of  such  tuples 
represents  a  possible  experimental  outcome.  Let  A'n^  and  E^n^  be  the 
response  and  reinforcement  random  variables  for  the  ith.  subject  on  trial 
n.     We  may  then  generalize  Axiom  L  to: 

Axiom  M.     For  1  ^i  ^  s,  if  El^^  =  k  and  P(a;„)  >  0  then 

(3.6)  P{A^:i,  =  j\x„)  =  (1  -  0^;')P{A^:'  =  j\xn-,)  +  e^^y^fi , 

where  0  ^  6';*\  X'A^  ^  1  and  2  ^jV  =  1- 

Experimental  tests  of  Axiom  M  for  two-person  situations  are  reported 
in  Estes  [5]   and   in   Atkinson  and   Suppes   [1],     Let  xl,'Li  be   just  the 
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sequence  of  first  n  —  1  responses  and  reinforcements  of  subject  i.  It  is 
a  consequence'  of  Axiom  M  that 

and  it  is  in  terms  of  xl^'l^  that  we  define  moments  (x['j,n  exactly  ana- 
logous to  (3.4).     We  shall  also  be  interested  in  the  joint  moments 

and  their  asymptotes  J]^....,J  if  they  exist.  To  work  with  these 
latter  moments  in  terms  of  Axiom  M  we  need  the  additional  reasonable 
assumption  that  when  all  the  n  —  1  preceding  responses  and  reinforce- 
ments are  given,  the  s  responses  on  trial  n  are  statistically  independent: 

Axiom  I.     If  P(x„_i)  >  0  then 

s 

P\An      ^=  .?!>    '  •  •  »  An      ^  Js\^n-l)   ^^   iA  ^\^n     ^   Ji\^n-l)    • 

i  =  l 

The  experimental  restriction  implied  by  Axiom  /  has  been  satisfied  in 
the  multiperson  studies  employing  the  linear  model. 

4.  Asymptotic  theorems  for  learning  models.  After  dealing  with 
some  matters  of  notation,  we  state  general  theorems  on  the  existence 
of  asymptotic  moments.  The  hypotheses  of  the  theorems  give  some 
broad  conditions  which  guarantee  ergodic  behavior.  We  begin  with  the 
one-person  models  satisfying  Axiom  L. 

In  this  section  it  will  be  convenient  to  use  some  of  the  notation  of 
§2,  Thus  we  may  write  P(A„  =  j\x^  +  x')  in  place  of  P(A„  =  ilx„-i) 
to  indicate  we  are  interested  in  the  last  m  terms  of  x^-i.  The  "sum" 
Xm  +  x'  is  just  the  combined  sequence  Xn-i.  We  reserve  the  subscript 
m  for  counting  back  m  trials  from  a  given  trial  n. 

To  clarify  the  general  theorem  it  is  desirable  to  define  in  an  exact 
way  the  notion  of  the  conditional  probability  of  a  reinforcing  event  de- 
pending on  only  a  finite  number  m  of  past  trial  outcomes  and  inde- 
pendent of  the  trial  number. 

Definition.  A  linear  model  has  a  reinforcement  schedule  with  past 
dependence  of  length  m  if,  and  only  if,  for  all  k,  n  and  n'  with  n,  n'  >  m 
and  all  x^,  x'  and  x" 

(4.1)  P{En  =   /Ck™  +  X')  =   P{En,  =  /CK  +  X")   . 

(It  is  understood  that  x^  includes  the  response  Aj,„  which  precedes  E^.,^^ 
on  trial  n.)  It  is  to  be  noticed  that  the  use  of  n  on  one  side  and  n'  on 
the  other  side  of  (4.1)  yields  independence  of   trial   number.     The  term 


1    Proof  of  this  fact  is  analogous  to  that  of  Theorem  4.8  of  [6]. 
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reinforcement  schedule  has  been  used  because  of  its  frequent  occurrence 
with  approximately  this  meaning  in  the  experimental  literature.  For  the 
conditional  probabilities  of  (4.1)  we  shall  use  the  notation 

(4.2)  ^'^..^.  =  P{E„  =  k\x^  +  x)  . 

We  may  now  state  the  first  general  theorem. 

Theorem  4,1.     Let  .'2^  he  a  linear  model  such  that 
(i)     J2^    has    a    reinforcement    schedule    with    past    dependence   of 
length  m*, 

(ii)     there  is  an  integer  k"^  such  that 

(a)  ^,*:^0 

(b)  there  is  a  8"^  and  an  m^  such  that  for  all  sequences  x  and  all 
integers  n 

P(S„.,„^=/c*b„)^5*>0. 

Then  the  asymptotic  moments  a)  of  j^"  all  exist  and  are  independent 
of  the  initial  distribution  of  responses. 

Proof.  The  central  task  is  to  characterize  ^  ^  as  a  chain  of  infinite 
order  and  show  that  satisfaction  of  the  hypotheses  of  the  theorem  im- 
plies satisfaction  of  conditions  (2.3)  and  (2.5).  With  this  accomplished 
the  asymptotic  theorems  of  §2  may  be  applied  to  S^  It  is  most  con- 
venient to  take  as  states  of  the  chain  the  ordered  pairs  (j,  k),  where  j 
is  the  response  on  trial  n,  say,  and  k  is  the  reinforcement  on  the  pre- 
ceding trial.  Consider  now  the  reinforcement  /c*  of  the  hypothesis  of 
the  theorem.  Let  i*  be  a  response  such  that  Xj*^*  ^  0.  (There  is  at 
least  one  such  j*  since  2  '^jt  =  1;  in  the  Estes  model  j*  =  k*.)  With 
the  pair  (j*,  k*)  as  the  state  jo  of  the  infinite  order  chain,  we  shall 
establish  (2.3)  and  (2.5). 

To  verify  (2.3),  we  use  (ii)b  of  the  hypothesis  and  the  following 
equalities  and  inequalities,  which  hold  for  all  x  and  ?«-: 

■i   (-^re  +  mo+l   ^^  3     >  ^n+m^  ^^   rC     \Xji) 

^      Z-t    -^V-^ra  +  mQ+i   ^^  J      l-ti'n  +  mj  ^   f^    >   ^mj-l     I     ^n) 
*^mo  - 1 

Applying  Axiom,  L,  the  right-hand  side  becomes: 

=      S    [(1   -   0,*)P{An^,^,  =   i*  bmo-1   +   X„)   +   d„.Xj.^*] 

^  ^.A,.,,r  by  (ii)b  . 
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To  establish  (2.5),  consider  the  following  equalities  and  inequalities: 

(4.3)  |P(A„,.i  =:  j,E„,  =  k\x  +  X')  -  P{A„„.,  =  3,  E„,,  =^  k\x  +  x")\ 

=  7r,,,JP(A„,^i  =  j\E„,  =:  k,x  +  x')  -  P(A„„+i  =  j\E^„  =  k,  x  +  x")\ , 

where  x,^.  means  the  last  m*  terms  of  x,  and  where  the  sequence  x 
contains  at  least  m  occurrences  of  k*,  with  m  >  m*.  The  equality 
follows  from  (i)  of  the  hypothesis,  for  by  virtue  of  (i) 

;r,.,^^,  -  P{E„,  =  k\x-\-x')  =  P{E,,„  =  k\x  +  x")  . 
Applying  Axiom  L  once  to  the  right-hand  side  of  (4.3)  we  get,  ignoring 

1P(A„,.,  =  i|^„,  =  A;,  a;  +  x')  -  P(A„„,i  =  j\E^„  =  k,  x  +  x")\ 
=  (1  -  6,)\P{A,,  =  j\x  +  X')  -  P(A„„  =  j\x  +  x")\  . 

We  do  not  know  that  6„  ^  0,  but  as  we  apply  Axiom  L  repeatedly,  we 
obtain  the  factor  (1  —  6^*)  at  least  m  times,  so  that 

(4.4)  1P(A„.,,  =  i,  £'„,  -  /clx  +  X')  -  P(A„„,i  =  j,  E,„  =  k\x  +  x")\ 

^  (1  -  e,r\P{An,-u  =  J\^')  -  P{A„.-Ax")\  , 

where  h  is  the  length  of  x\  The  diiference  term  on  the  right  of  this 
inequality  is  not  more  than  1,  so  that  from  (4.4)  we  obtain  the  estimate 
for  m  >  m* 

£„,  ^  (1  -  e,.r  , 

whence 


m  =  0 


which  is  (2.5). 

On  the  basis  of  (2.3)  and  (2.5)  we  know  from  Theorem  2.4  that  the 
asymptotic  cross-moments  of  jS^"  exist  and  are  independent  of  the  initial 
distribution  of  responses.     But 

P(A„  =  ilx„_,)  =  I:P(A„  =  i,£'„_i  =  k\xn-,)  , 

and  so  the  moments  a)^„  can  be  expressed  as  sums  of  the  cross-moments 
for  the  infinite  order  chain  ^f  which  insures  the  existence  of  the  limit- 
ting  moments  (3.5)  and  that  they  do  not  depend  upon  initial  conditions. 
There    are    several    remarks    to    be  made   about   the   theorem   just 

2  If  all  0*  -V=  0,  the  original  condition  given  in  [3]  would  be  satisfied;  our  weaker  con- 
dition (2.5)  allows  inclusion  of  cases  where  some  of  the  e^  are  0  (i.e.  where  there  can  be 
trials  without  a  reinforcement). 
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proved.  First,  we  observe  that  a  simple  sufficient  (but  not  necessary)  con- 
dition for  (ii)b  is 

(4.5)  min  7t,c'.x.  ^  0  . 

The  interpretation  of  (4.5)  is  that  the  reinforcing  event  k*  has  positive 
probabihty  on  every  trial  no  matter  what  sequence  x^*  of  responses  and 
reinforcements  preceded.  A  number  of  interesting  experimental  cases 
of  the  hnear  model  can  be  described  in  terms  of  (4.5),  (i)  and  (ii)a  of 
Theorem  4.1. 

I.  Contingent  case  with  lag  v.  In  the  Estes  model  let  P(£'„  = 
k\An-„  =  3,  x)  —  Tt^^{v),  for  all  x  such  that  P(^„-„  =  j,  x)  >  0.  To  satisfy 
(4.5),  we  need  only  that  for  some  k,  7Zj^{v)  i^  0  for  all  j.  Experimental 
data  for  v  =  0,  1,  2  are  given  in  Estes  [5]. 

II.  Double  contingent  case.     Let 

P{E^  =^k\A^  =  3,  A„_i  =  j',  x)  =  Ti^jjr  , 

for  all  X  such  that  P(A„  =  j,  A^-^  =  j',  x)  >  0  . 

Then  (i)  of  Theorem  (4.1)  is  immediately  satisfied,  and  for  (ii)a  and 
(4,5)  we  need  a  k  such  that  ^^  ^  0  and  for  all  j  and  j',  tt^-jj,  ^  0. 

An  interesting  fact  about  (I)  and  (II)  is  that  although  they  are 
simple  to  test  experimentally  and  their  asymptotic  response  moments 
exist  on  the  basis  of  Theorem  4.1,  there  is  no  known  constructive  method 
for  computing  the  actual  asymptotes.  (The  Estes  [5]  test  of  (I)  excludes 
non-reinforced  trials  which  cause  the  computational  difficulties.)  It  may 
also  be  noted  that  the  convergence  theorems  in  Karlin  [8]  do  not  in 
general  apply  to  (II),  and  apply  to  (I)  only  if  v  =  0. 

On  the  basis  of  the  proof  of  Theorem  4.1  we  may,  by  virtue  of 
Theorem  2.2,  conclude  that  the  asymptotic  joint  probabilities  of  successive 
responses  also  exist: 

Corollary  1.  //  the  hypothesis  of  Theorem  4.1  is  satisfied,  then 
for  every  m  the  limit  as  n  ^-  co  of 

-i    (-^n+m   ^^  3mf  ■'^n  +  m-l   —   3m-l>    '  '  '  >  -^n  —  3o) 

exists. 

We  may  regard  the  quantities  P{A„  =  jlx^.^),  for  1  £  j  ^  r  as  a 
random  probability  vector  with  an  arbitrary  joint  distribution  F^  on  trial 
1,  and  distribution  F^  on  trial  n.  The  following  corollary  is  a  consequence 
of  the  existence  of  the  moments  a)  independent  of  the  initial  response 
probabilities. 
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Corollary  2.  //  the  hypothesis  of  Theorem  4.1  is  satisfied,  then 
there  is  a  unique  asymptotic  distribution  F„,  independent  of  F^  to  which 
the  distributions  F„  converge. 

For  the  multiperson  situation  characterized  by  Axioms  /  and  M,  we 
have  a  theorem  analogous  to  Theorem  4.1.  For  use  in  the  hypothesis 
of  this  theorem  we  define  the  notion  of  reinforcement  schedule  with 
past  dependence  of  length  m,  exactly  as  we  did  in  (4.1),  namely,  we 
have  such  a  schedule  if  for  all  /c,  1  ^  i  ^  s,  all  n  and  n'  with  n,  n'  >  m 
and  all  x„„  x'  and  x" 

^.(1)  ...,.(.o,      =  PiEir  =  k<'\  ...,  Eir  -  k^'>\x^  +  X') 
=  P(Ell'  =  k^'\  .  ■',£[:'  =  /c^')  \x,,  +  X")  . 

Theorem  4.2.     Let  ^//   be  an  s-person  linear  model  such  that 
(i)     ^//    has    a    reinforcement    schedule   with   past   dependence   of 
length  m*, 

(ii)     there  are  integers  k^^^\  for  1  ^  i  ^  s,  such  that 

(a)  ^1H,*^0, 

(b)  there  is  a  8*  and  an  m^  such  that  for  all  sequences  x  and 
all  integers  n 

P{E\r.n.,  =  k^'^\  .  •-,E\:U,^  -  k^'^*\x,,)  ^  r  >  0  . 

Then  the   asymptotic   moments   7\i).  (3),...,,(s)  of  -/^    ctZi   exist   and  are 
independent  of  the  initial  distribution  of  responses. 

Proof.  The  states  of  the  chain  are  now  defined  as  2s-tuples 
(i*^\  •  •  •,  j^'\  k'^^\  •••,A;("),  where  j^'>  is  the  response  made  by  the  ith 
subject  and  k^'^  is  the  reinforcement  for  that  subject  on  the  preceding 
trial.  Using  the  reinforcements  k'-'^*  of  the  hypothesis,  let  j'^'>*  be  such 
that  ^^(^,.^(0-  ^  0.  We  take  (j^'>\  •  •  •,  i^'^*,  k^'^*,  •  •  •,  k^''>*)  as  the  state  jo 
for  which  we  establish  (2.3)  and  (2.5).  To  simpHfy  notation,  it  is  con- 
venient to  define: 

P..,{j,k\x)  -  P{Al,'l,  =  i«,  .-.,  A'„n,  =  j^^\  E^:'  =  A;«,  •••,  ^^'^  =  k^''\x), 
Pn.AJ''>\k,  X)  =  PiAi'l,  =  i^^^l^^''  -  k^'\  . . .,  Elp  =  k^^>,  x) , 

Moreover,  we  omit  the  superscript  notation  from  6  and  X. 

To  verify  (2.3)  we  proceed  exactly  as  in  the  proof  of  Theorem  4.1, 
applying  now  Axioms  I  and  M  instead  of  L,  and  we  obtain  that 


6 

2?n  +  mo+l(y,  k\x„)   ^n  6*^(0 A^(f)*^(t).5*   . 


I 


I 
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For  (2.5),  we  first  observe  that  by  virtue  of  (i)  of  the  hypothesis 
and  Axiom  I 

\Vn'^i{3,k\x  +  x')  -  Vn"AJ,  k\x  +  x")\ 

=  TT,,™*! II  Vn'AP^\k,  a;  +  a^')  -  ri  Vn^'^APAk,  x  +  x")\  . 

1=1  i=i 

We  notice  next  that  the  right-hand  side  is 

^  ^^.r.*\pn'^,  W^\k,  X  +  x')\Up„,^,iJ^'>  \k,x  +  X') 
-np„„UJ^'^k,x  +  x")\ 

+  Up^„^,{j^'^\k,x  +  x")\p„.,,{j^'>\k,x  +  x')-p^.,,,{j^'^k,x  +  x")\  . 
Continuing  this  same  development,  we  obtain: 

^  E  \Pn'.AJ^'^\k,  x  +  x')~  Pr,>^.,U'''Ak,  x  +  x")\  . 

1=1 

And  by  the  Hne  of  reasoning  used  in  the  proof  of  Theorem  4.1,  if  the 
sequence  x  contains  state  (i*^^^',  •  •  •,  A:"^'^*)  at  least  m  times  the  last 
quantity  is 

i  =  l  '^ 

Provided  m  >  w*  this  inequality  yields  an  estimate  of  £„j  from  which 
we  conclude  that  (2.5)  holds.  The  existence  of  the  asymptotic  moments 
then  follows  from  the  theory  of  §2  as  in  the  case  of  Theorem  4.1.     Q.E.D. 

A  pair  of  corollaries  follow  from  the  theorem  just  proved  which  are 
exactly  like  the  two  given  after  Theorem  4.1. 

Finally,  we  want  to  remark  that  Axiom  L  involves  linear  functions 
which  are  distance  diminishing,  i.e.,  have  slope  less  than  one.  The 
asymptotic  results  of  this  section  apply  to  many  learning  models  in 
which  these  linear  functions  are  replaced  by  non-linear  functions  having 
this  property. 
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FINITE  MARKOV  PROCESSES  IN  PSYCHOLOGY* 
George  A.  Miller 

MASSACHUSETTS    INSTITUTE    OF   TECHNOLOGY 

Finite  Markov  processes  are  reviewed  and  considered  for  their  usefulness 
in  the  description  of  behavioral  data.  The  various  alternative  responses  in 
an  experimental  situation  define  a  vector  space,  and  changes  in  the  probabili- 
ties of  these  alternatives  are  represented  by  movements  in  this  space.  Meth- 
ods of  fitting  the  theory  to  experimental  data  are  considered. 

The  simplest  process,  with  a  constant  matrix  of  transitional  probabilities 
that  is  applied  repeatedly  to  represent  the  effect  of  successive  trials,  seems 
inadequate  for  most  learning  data.  A  matrix  function  that  may  be  useful  for 
learning  theory  is  presented. 

In  the  two  general  areas  where  psychology  has  been  relatively  successful 
as  a  quantitative  science,  i.e.,  sensory  psychology  and  test  construction, 
probabilistic  considerations  long  ago  proved  their  worth.  It  is  characteristic 
of  these  two  areas,  however,  that  the  observations  are  relatively  invariant 
in  time.  The  basic  parameters  can  be  explored  at  length  because  sequential 
effects  of  measurement  are  secondary  and  can  be  ignored  or  randomized. 
This  fortunate  situation  makes  it  possible  to  use  familiar  probability  models 
based  upon  independent  random  variables. 

With  the  more  dynamic  problems  of  psychology,  however,  this  familiar 
model  has  not  often  led  to  profitable  results.  For  example,  it  is  intrinsic  in 
the  very  notion  of  learning  that  successive  measurements  are  not  inde- 
pendent; attempts  to  use  a  theory  of  independent  variables  must  either  fail 
or  misrepresent  the  basic  process.  Such  failures  may  lead  to  a  rejection  of 
statistical  concepts  as  inadequate;  a  more  proper  attitude  is  to  abandon  the 
assumption  of  independence  and  ask  what  help  can  be  had  from  dependent 
probabilities.  The  simplest  mathematical  models  incorporating  dependent 
probabilities  are  the  finite  Markov  processes.  In  this  paper  such  processes 
are  examined  for  their  usefulness  and  their  limitations  for  describing  psycho- 
logical data. 

1.  Simple  Markov  Chains  with  Two  Alternatives.  The  data  from  psycho- 
logical experiments  usually  come  in  the  form  of  sequences  of  choices  em- 
bedded in  the  time  continuum.  Often  it  is  possible  to  ignore  the  temporal 
order  in  which  alternative  choices  occur.  The  purpose  of  this  discussion, 

*This  article  was  written  at  the  Institute  for  Advanced  Study  in  Princeton,  New 
Jersey,  while  the  author  was  on  sabbatical  leave  from  Harvard  University. 

This  article  appeared  in  Psychometrika,  1952,  17,  149-167.    Reprinted  with  permission. 
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however,  is  to  examine  situations  in  which  the  temporal  sequence  should 
not  be  ignored.  We  shall  adopt  the  Markovian  model  of  dependent  prob- 
abilities to  discuss  such  sequences.  We  begin,  therefore,  with  the  simplest 
possible  example  of  a  Markov  chain. 

Consider  an  experiment  in  which  only  two  alternative  responses  are 
possible.  A  trial  consists  of  a  choice  of  one  of  these  two  alternatives.  If  the 
letters  A  and  B  designate  these  choices,  then  a  sequence  of  trials  might 
produce  the  sequence  of  responses  ABBAAABA  .  . .  ,  where  the  durations 
and  latencies  are  ignored.  We  shall  assume  that  this  sequence  is  produced 
by  a  Markov  process;  i.e.,  that  the  distribution  of  probabilities  at  trial  n  +  1 
depends  upon  the  outcome  of  trial  n.  However,  the  knowledge  of  outcomes 
prior  to  n  does  not  change  our  description  of  the  system  if  we  know  the 
outcome  of  trial  n.  In  other  words,  the  present  state  of  the  system  governs 
its  future  development. 

We  adopt  the  following  notation: 

n  number  of  the  trial:  0,  1,  2,  ...  . 

A  and  B    the  two  alternative  responses. 

p*"^(^)       probability  of  alternative  A  at  trial  n. 

p(A)  asymptotic  value  of  p'"'  (^4)  as  n  — >«> . 

d„  the  set  of  absolute  probabilities  at  trial  n,  considered  as 

a  vector;  [p'"'(^),p'"'(B)]. 
Va{B)         given  A  at  n,  the  conditional  probability  of  5  at  w  +  1. 
plT^B)      given  A  at  n,  the  conditional  probability  oi  B  at  n  -\-  m, 

m  =  2,  3,  .  .  •  . 
T  matrix  of  transitional  probabilities. 

X,-  characteristic  roots  of  the  matrix  T. 

Alternative  A  can  occur  at  trial  n  +  1  in  either  of  two  ways.  Either  it 
follows  an  A  on  trial  n,  or  it  follows  a  B  on  trial  n.  Similarly,  B  can  occur 
at  n  +  1  in  either  of  two  ways.  This  obvious  fact  leads  to  the  following 
equations: 

p'"\A)pM)  +  p'"'{B)ps(A)  =  p'"*^'(A)  ^j^ 

p'"\A)pAB)  +p"'\B)ps(B)   =p'"''\B). 
In  matrix  notation  these  equations  can  be  written 

ipM)     PB{A))y"\A)}W''''\A)l  ^2) 

]p,(B)     psiBMp^'HB))        y-^'HB)) 

The  reader  is  assumed  to  be  familiar  with  the  elements  of  matrix  theory. 
If  the  distribution  of  probabilities  on  trials  n  and  n  +  1  is  regarded  as  the 
vectors  d„  and  d„+i  in  a  two-dimensional  space,  then  the  square  matrix  of 
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transitional  probabilities  is  a  linear  transformation  or  operator  mapping  c?„ 
into  dn+i  .  Thus  we  can  write  Eq.  (2)  as 

Td„  =  rf„+i  .  (3) 

Any  sequence  of  distributions  can  be  produced  by  operating  upon  the 
successive  di  by  appropriate  transformations.  For  the  moment,  however,  we 
shall  consider  a  special  case.  We  shall  assume  that  repeated  trials  can  be 
represented  as  repeated  transformations  by  the  same  operator.  Thus  we 
can  write  for  the  initial  trial: 

Tdo  =  d,  . 
A  second  trial  carries  di  into  dz  : 

Td,  =  c?2  . 

In  terms  of  do  ,  therefore,  we  can  write: 

Td,  =  T(Tdo)  =  T%  =  d^  . 

Or  more  generally, 

T"do  =  d„  .  (4) 

Since  the  probabilities  of  A  and  B  on  successive  trials  are  given  by 
!r"do ,  we  proceed  to  examine  the  powers  of  T.  The  elements  of  T"  are  Pi"\j), 
where  i  =  A,B;  j  —  A,B.  We  wish  to  find  a  general  expression  for  T"  in 
terms  of  p.O)  and  n.  From  matrix  theory  we  know  that  every  square  matrix 
with  distinct  roots  is  similar*  to  a  diagonal  matrix  whose  diagonal  elements 
are  the  characteristic  roots  \i  oi  T.  We  designate  this  similar  diagonal  matrix 
by  A,  and  write 

A  =  S-'TS, 

where  >S  is  a  matrix  whose  columns  are  the  characteristic  vectors  of  T.  From 
this  we  obtain 

T  =  SAS-\ 

To  obtain  the  powers  of  T  we  note  that 

T'  =  SAS-'SAS-'  =  Sa'S-\ 

or  more  generally, 

r  =  SA''S-\  (5) 


Powers  of  A  are  simply  calculated,  for  since  A  is  a  diagonal  matrix,  its  powers 
are  given  by  the  powers  of  the  diagonal  elements  X,  . 

To  find  A  for  the  matrix  of  Eq.  (2)  we  first  write  the  characteristic 
equation  for  the  matrix  T.  If  we  use  the  fact  that  PAiA)  -\-  Pa(B)  =  1  (and 

*Two  matrices  are  said  to  be  similar  when  they  have  the  same  characteristic  roots. 
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similarly  for  B  subscripts),  the  determinantal  equation  can  be  written  in  the 
convenient  form 

det  {T  -  \I)  =\'  -  \pM)  +  Pb(5)]X  +  [pM)  -  PsiA)]  =  0. 

The  roots  of  this  equation  are  the  characteristic  roots  of  the  matrix: 

Xi  =  1        and        Xa  =  Pa(A)  —  Pb(A). 

Since  the  sums  of  all  the  columns  of  T  are  unity,  we  note  that  unity  is  always 
a  root  of  these  matrices.  Substituting  these  roots  into  Tvi  =  X.y,-  and  solving 
for  the  characteristic  vectors,  y,-  ,  we  obtain  the  vectors  [1,  Pa{B)/pb(A)] 
and  (1,  —1).  These  vectors  comprise  the  columns  of  S,  and  so  from  Eq. 
(5)  we  obtain,  after  inverting  S, 

(    1         n  (r  0  . 


^  -if)o   wA)-p.u)rf^^(^)  +  ^=(^) 


PsiA)        Pb(A) 


(6) 


Eq.  (6)  can  be  written  more  conveniently 
,  \pb{A)    pBiA)} 


rpn    __ 


Pa{B)+Pb{A)  (pAB)      pAB)) 

+ 


[pM)  -  pb{A)T  ^    ^^^^^      ^'^M.   (7) 


pAB)+pBiA)     i-pAB)        PBiA) 


Since  |  Pa{A)  —  Pb(A)  \  <  1,  the  second  term  on  the  right  of  Eq.  (7)  goes 
to  zero  as  n  ^  00  J  so  the  first  term  represents  the  asymptotic  form  of  T". 

With  Eq.  (7)  we  can  calculate  T"do  ,  and  so  obtain  the  probability  of 
A  on  successive  trials: 


p^B)  +  Pb(A) 


The  value  of 


,     r^/,^         ,.,^1n?>^°^(A)?>.(^)  -  p''\B)Pb(A) 


Pa(B)  +  PsiA) 


It  is  apparent  that  Eq.  (8)  can  be  written 

p^-'iA)  =  ail  -  be-n,  (9) 


I 


where 


a  = 
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Pb{A) 


pAB)  +  PsiA)  ' 
b  =  -p''\A)  2^  +  p^^\B), 

c  =  -In  [pM)  -  Pb(A)]. 

Eq.  (9)  is  an  exponential  growth  function — a  form  frequently  used  to  de- 
scribe data  from  learning  experiments.  It  should  be  noted,  however,  that 
while  the  average  subject  may  follow  such  a  learning  function,  the  individual 
subjects  are  generating  stationary  time  series  that  do  not  represent  learning. 
The  term  "learning"  probably  should  be  reserved  for  those  cases  in  which 
the  matrix  operator  changes  on  successive  trials. 

We  shall  illustrate  the  use  of  the  Markov  chain  with  a  numerical  ex- 
ample. Suppose  that  two  alternative  responses  are  called  right  (R)  and 
wrong  (W),  that  p''"\R)  and  p^"\W)  are  measured  by  the  percentage  of 
subjects  in  a  large  sample  that  choose  R  and  W  on  trial  n,  and  that  the 
transitional  probabilities  observed  on  successive  pairs  of  trials  are  constant. 
Assume  the  following  numerical  values  for  T  do  =  di  : 

=   j-27 
1.73 

A  right  response  is  followed  by  another  right  response  97  per  cent  of  the 
time;  wrong  follows  wrong  73  per  cent  of  the  time.  From  Eq.  (8)  we  calculate 
that  the  successive  values  of  p^"\R)  are  0,  .27,  .46,  .59,  .68,  etc.,  approaching 
the  asymptote  of  .90.  The  equation  is 

p'-'iR)  =  .9(1  -  .7")  (n  =  0,  1,  2,  .  .  .) 

If  we  know  that  on  a  particular  trial  a  W  occurred,  this  equation  gives  the 
probability  of  R  on  the  nth  succeeding  trial. 

2.  Autocorrelation  Function.  A  simple  parameter  of  such  Markov  chains 
is  the  autocorrelation  function.  We  will  mention  it  now  because  for  the  more 
complex  cases  we  wish  to  consider  next  the  autocorrelation  function  is  either 
not  defined  or  is  most  tedious  to  compute  from  the  matrix  of  transitional 
probabilities. 

The  autocorrelation  function  is  the  correlation  of  a  time  series  with  itself 
displaced  0,  1,  2,  .  .  .  steps.  With  zero  displacement  the  correlation  of  the 
series  with  itself  is,  of  course,  +1.  With  a  displacement  of  one  step,  the 
responses  on  trials  1,  2,  3,   •  •  •  are  correlated  with  the  responses  on  trials 
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2,  3,  4,  ...  .If  the  series  of  binary  choices  is  fairly  long,  the  autocorrelation 
after  a  displacement  of  one  step  is  given  by 

n  =  pM)  -  PsiA).  (10) 

We  note  that  Vi  is  a  characteristic  root  of  the  matrix  of  transitional  prob- 
abilities. More  generally, 

r„  =  p'rXA)  -  p's'-XA),  (11) 

where  Pa'^A)  and  p]r\A)  are  elements  of  T"*.  From  Eq.  (7)  we  observe 
that  these  elements  of  T""  are 


^'  ^^^  ~  Va{B)  +  pM) 


and 

Vb 


(-)(^)  ^  Vb{A)  -  ps{A)[pM)  -  PBiA)T 


Pa{B)  +  ps(A) 

When  these  values  are  substituted  in  Eq.  (11),  we  obtain 

r„  =  [p^{A)  -  pBiA)r  =  rT.  (12) 

In  short,  for  a  simple  Markov  chain,  the  autocorrelation  between  positions 
n  and  n  +  m  is  the  wth  power  of  the  autocorrelation  between  n  and  n  +  1. 
If  I  ri  I  <  1,  then  |  r„  |  declines  monotonically  toward  zero. 

A  simple  example  is  provided  by  the  Samoan  language.  E.  B.  Newman 
has  noted  that  the  sequence  of  consonants  (C)  and  vowels  {V)  in  Samoan 
writing  is  adequately  described  as  a  Markov  chain  with  the.  following  matrix 
of  transitional  probabilities: 

\pc(C)      py(C))   ^  |0     .49/ 

iPciV)        Py{V)\  h         .51  ( 

Consonants  never  follow  consonants  in  written  Samoan.  The  autocorrelation 
function  is  easily  computed  from  this  matrix.  For  successive  displacements 
of  one  letter  the  value  of  the  correlation  coefficient  is  1,  —.49,  .24,  —.12, 
.06,  -.03,  etc. 

The  autocorrelation  function  for  this  simple  process  can  also  be  de- 
scribed as  the  determinant  of  T".  Thus  Tq  is  the  determinant  of  T°  =  I,  Vi 
is  the  determinant  of  T,  r'2  is  the  determinant  of  T^,  etc. 

When  the  distribution  of  probabilities  at  n  +  1  depends  upon  events 
prior  to  n  as  well  as  upon  n  itself,  Eq.  (10)  still  holds  as  a  definition  of  the 
autocorrelation  function,  but  Eq.  (11)  does  not  hold.  When  more  than  two 
unsealed  alternatives  are  used,  the  autocorrelation  function  is  not  defined. 

3.  Extension  to  More  than  Two  Alternatives.  The  extension  of  the  matrix  j 
equations  to  experiments  involving  more  than  two  alternative  responses  isj 
straightforward.  Designate  the  alternatives  A,  B,  C,  ...  ,  A^.  Then  we  havei 
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Va{B)     vb{B) 


Vn{A) 
Vn{B) 


[va{N)    Vb{N) 


\  =\ 


V.{N)]  W\N)] 


(      (n  +  l) 

V 


{A) 


[p"'''\N)} 


(13) 


General  solutions  are  known  for  certain  types  of  operators.  These  are  of 
considerable  interest  in  physics  and  genetics,  where  the  elements  of  T  are 
given  by  theory.  The  present  use  of  such  operators  is  almost  purely  de- 
scriptive, however,  for  we  do  not  know  what  special  types  of  matrices  will 
be  of  the  greatest  psychological  interest. 

It  is  not  always  necessary  to  find  a  general  solution.  A  qualitative  un- 
derstanding of  an  experimental  situation  is  often  provided  by  simply  trans- 
forming the  initial  distribution  five  or  ten  steps  by  direct  matrix  multi- 
plication. For  example,  a  learning  situation  might  be  analyzed  into  three 
kinds  of  responses:  correct  (C),  slightly  wrong  (S),  and  grossly  wrong  (G). 
During  the  course  of  learning  a  subject  begins  by  making  gross  mistakes, 
then  slight  mistakes,  and  finally  manages  to  make  correct  responses.  Such  a 
situation  could  produce  a  matrix  equation  like  the  following: 


( 


PciQ    psiQ    p, 


.(5)><19<°'(*S)>  =  <.l     .6     .3><0>. 


(pciG)    ps(G)    PoiG))  W°\G))        (o     .1     .7)  (l) 

It  is  tedious  to  find  the  general  solution  of  T",  and  it  is  easy  to  see  by  direct 
multiplication  what  happens.  The  proportion  of  grossly  wrong  responses 
declines  steadily:  1,  .7,  .52,  .40,  .32,  .26,  .  .  •  ,  .08.  The  proportion  of  small 
errors  on  successive  trials  at  first  increases,  then  decreases:  0,  .3,  .39,  .40, 
.38,  .35,  •  •  •  ,  .23.  The  proportion  of  correct  responses  gives  a  roughly  S- 
shaped  function:  0,  0,  .09,  .20,  .30,  .38,  .45,  .  .  .  ,  .69.  This  situation  is  analogous 
to  pouring  water  from  one  vessel  into  a  second,  which  in  turn  pours  the  water 
into  a  third.  The  asymptotic  distribution  can  always  be  found  by  solving 
the  equation  Td„  =  d„  . 

The  form  of  a  general  solution  can  be  indicated,  for  finite  matrices  with 
distinct  roots,  as  follows.  Let  X,-  represent  the  N  characteristic  roots  of  the 
polynomial  det  {T  —  X/).  We  define  a  set  of  matrices  /.(T)  by 


fm 


(r  -  X J)(T  -  XJ)  '■■  (T  -  X,-J){T  -  X,. J)  ■■■  jT  -  \^I) 

(X,  -  xo(x,  -  X2)  •  •  •  (X,  -  x.-_0(x,  -  x,.o  •  •  •  (X,  -  x.v) 


(14) 
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In  terms  of  these  matrices,  T  can  be  expressed 

T  =  \jm  +  \,f,{T)  +  .  •  •  +  \^f^{T),  (15) 

If  g{\)  is  a  rational  scalar  polynomial,  then 

g{T)  =  9(\;)MT)  +  g{X2)UT)  +  •  •  •  +  gMf^T).  (16) 

In  particular,  if  g{X)  =  X",  we  have 

r  =  xr/iCr)  +  x:/.(70  +  •  •  •  +  xif^T).  (i7) 

The  2X2  transformation  is  expressed  in  this  form  in  Eq.  (7).  Concerning 
the  roots  X,-  ,  we  know  that  Xi  can  be  assigned  the  value  1,  and  that  all  the 
other  roots  fall  between  —1  and  +1.  Thus  the  asymptotic  value  of  T"  is 
givenby /i(r). 

The  solution  for  a  particular  matrix  can  always  be  obtained  by  (a)  finding 
the  roots  of  the  characteristic  polynomial,  det(r  —  X7);  (b)  determining  the 
fi{T)  according  to  Eq.  (14);  (c)  substituting  into  Eq.  (17);  and  (d)  solving 
T"do  for  the  given  boundary  conditions  of  do  .  This  procedure  has  the  ad- 
vantage of  avoiding  the  problem  of  inverting  a  large  matrix,  but  if  two  or 
more  roots  are  nearly  the  same,  the  computations  may  be  quite  difficult. 

The  autocorrelation  function  is  not  defined  for  more  than  two  unordered 
alternatives,  because  the  value  of  the  correlation  coefficient  varies  according 
to  the  various  possible  assignments  of  numerical  values  to  the  different 
alternatives.  However,  the  determinant  of  the  matrix  of  transitional  prob- 
abilities has  many  of  the  characteristics  of  a  correlation  coefficient,  and  in 
the  2X2  case  the  determinant  and  the  autocorrelation  coefficient  are 
identical.  The  determinant  of  T",  as  a  function  of  n,  lies  between  +1  and  —1, 
declines  toward  0  for  the  Markov  processes,  and  can  reveal  periodicities  in 
much  the  same  way  as  an  autocorrelation  function.  The  possible  usefulness 
of  this  extension  to  N  X  N  transformations  needs  to  be  explored. 

4.  Extension  to  Compound  Responses.  For  psychological  purposes  it  is 
an  inconvenience  that  Markov  processes  have  no  memory.  We  must  now 
remove  the  restriction  that,  if  the  outcome  of  the  trial  n  is  known,  events 
prior  to  n  are  irrelevant  for  predicting  the  outcome  at  n  +  1.  We  must  con- 
sider the  non-Markovian  case.  What  we  must  do  is  to  expand  the  definition 
of  a  state  of  the  system  in  order  to  make  such  systems  Markovian  in  a 
larger  space. 

If  the  probabilities  at  trial  n  -\-  1  depend  upon  the  outcomes  of  trials  n 
and  n  —  1,  but  knowledge  of  events  prior  to  n  —  1  does  not  change  our  pre- 
diction for  n  -f-  1,  we  have  a  non-Markovian  system.  This  system  is  made 
to  be  Markovian  by  changing  the  definition  of  an  event.  Instead  of  char- 
acterizing the  state  of  the  system  by  the  occurrence  of  a  single  response, 
we  characterize  it  by  pairs  of  responses.  If  there  are  two  atomic  alternatives, 
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A  and  B,  in  the  original  system,  then  there  are  four  compound  alternatives, 
AA,  AB,  BA,  and  BB,  in  the  new  system.  Thus  we  must  define  a  distribution 
d„  over  four  alternatives,  and  7"  is  a  square  matrix  of  fourth  order: 


Td„  = 


Paa(AA) 
Paa{AB) 

0 

0 


0 
0 

Pab(BA) 
Vab{BB) 


Vba{AA) 
Vba{AB) 
0 
0 


0 
0 

Vbb{BA) 
Vbb{BB)] 


V'^'\AA) 
V'"'  {AB) 
p'"'  (BA) 

[p'^Xbb)  J 

^p'^^^'^AAy 
p'^^'^AB) 

(n+l) 


\  =  d„,i  .   (18) 


P {BA) 

yp'""^''  {BB)] 

Note  that  many  of  the  transitional  probabilities  are  zero;  it  is  not  possible 
for  the  system  to  move  from  some  state  to  others  in  a  single  step.  For  ex- 
ample, the  system  cannot  move  from  AA  to  BB  in  less  than  two  steps: 
AA  — >  AB  — >  BB  as  in  the  sequence  AABJB. 

Tabulations  of  sequences  of  vowels  and  consonants  in  written  Hebrew 
have  been  made  by  E.  B.  Newman.  The  sequence  of  consonants  {A)  and 
vowels  {B)  can  be  adequately  represented  by  a  matrix  of  the  form  of  Eq.  (18): 


0 

0 

.23 

0 

.095 

1 

0 

.77 

0 

> 

^  .410 

0 

.81 

0 

.90 

.410 

0 

.19 

0 

.10 

.085 

As  before,  the  transformation  T  can  be  applied  iteratively  to  carry  any 
initial  distribution  into  a  final,  unique,  stable  distribution. 

This  extension  of  the  Markov  process  can  be  carried  as  far  as  the  data 
seem  to  merit.  For  example,  fixed-ratio  reinforcement  in  operant  conditioning 
requires  an  animal  to  respond  m  times  in  one  way,  then  approach  the  food 
tray.  In  order  to  keep  track  of  the  sequential  aspects  of  this  behavior  we 
could  define  a  state  of  the  system  to  include  all  the  possible  sequences  of 
responses  and  approaches  of  length  m  -\-  \.  Thus  there  would  be  2'"'^^  alter- 
native states,  and  the  transformation  would  be  of  order  2'"*\  More  complex 
sequential  dependencies  arise  in  human  verbal  behavior  and  can  be  treated 
in  a  similar  manner.  The  verbal  case  is  so  complex,  however,  that  it  cannot 
be  adequately  discussed  in  this  paper. 

In  principle  it  is  possible  to  extend  the  Markov  definition  indefinitely 
to  take  into  account  as  much  of  the  past  history  of  the  system  as  one  desires. 
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Cases  are  known,  however,  in  which  the  extension  would  need  to  be  carried 
infinitely  far  into  the  past  in  order  for  the  Markov  model  to  summarize  all 
the  information.  Such  cases  are  better  handled  in  other  ways.  At  present,  it 
seems  likely  that  most  learning  situations  will  need  to  be  described  by  these 
other  methods,  and  that  Markov  processes  using  a  single  matrix  of  transitional 
probabilities  are  most  valuable  when  the  behavior  has  settled  into  a  relatively 
stable  pattern. 

5.  Leastr-Squares  Fit  to  Data.  Under  the  assumption  that  a  single  trans- 
formation describes  the  behavior,  every  trial  can  be  considered  a  measurement 
of  the  single  transformation  T.  We  wish  to  find  a  least-squares  solution  that 
will  give  the  best  estimate  for  T  from  the  available  data.  The  following  pro- 
cedures may  not  be  the  most  efficient  for  Markov  processes,  but  they  represent 
one  fairly  natural  extension  of  the  procedures  used  with  more  familiar  statis- 
tical problems. 

We  introduce  a  matrix  M  to  represent  the  observed  data.  This  matrix 
is  formed  by  placing  in  successive  colunms  the  distributions  observed  on 
successive  trials,  from  trial  1  through  trial  n  —  1.  If  each  distribution  con- 
tains a  alternative  quantities,  and  n  such  distributions  are  known  for  suc- 
cessive trials,  then  M  is  an  a  X  (n  —  1)  matrix.  A  matrix  N  is  formed 
analogously  by  placing  in  successive  columns  the  distributions  observed  on 
the  successive  trials  from  2  through  n.  Thus  N  is  also  an  a  X  (n  —  1)  matrix. 
The  matrix  A^  represents  the  best  estimate  of  the  successive  distributions: 

N  =  N  +  C,  (19) 

where  the  elements  of  the  matrix  C  are  the  corrections  that  must  be  added 
to  the  observed  values  in  iVto  give  the  best  estimate  N. 

We  wish  to  determine  T,  the  best  estimate  of  the  transformation.  From 
the  definition  of  M  and  A^  and  the  assumption  of  a  single  operator  throughout 
learning,  we  have  the  equation : 

TM  =  N  =  N  +  C.  (20) 

From  Eq.  (20)  we  obtain  an  expression  for  C\ 

C  =  -N  +  TM.  (21) 

For  a  least-squares  solution,  CC  must  be  a  minimum.  This  is  obtained  by 
putting  the  partial  derivative  with  respect  to  T  to  zero : 

-1  CC  =  MC  =  0.  (22) 

dT 

We  now  substitute  for  C  from  Eq.  (21)  into  Eq.  (22)  and  obtain 

M{-N  +  TM)'  =  -MN'  +  MM'T'  =  0. 
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Rearranging  terms  gives 

f'  =  {MM'Y'MN', 
or  _ 

f  =  NM'{MM')-^.  (23) 

Eq.  23  provides  a  best  estimate  of  T  on  the  basis  of  the  data  matrices  M 
and  N. 

As  an  example,  consider  an  experiment  in  a  T-maze.  We  decide  from 
an  examination  of  the  data  that  the  learning  process  can  be  described  by  a 
Markov  process  with  a  single  transformation.  Suppose  that  10  rats  were  run 
for  20  trials,  and  that  on  successive  trials  the  following  numbers  of  rats 
made  the  correct  choice:  5,  7,  6,  6,  8,  8,  8,  7,  8,  9,  8,  7,  8,  9,  10,  10,  8,  8,  9,  9. 
From  these  data  we  construct  the  matrices: 


M  = 


N  = 


.5     .7    .6     .6     .8     .8     .8     .7    .8     .9     .8     .7     .8  .9 

.5     .3     .4     .4     .2     .2     .2     .3     .2     .1     .2     .3     .2  .1 

1.0  1.0     .8     .8     .9| 
0      .0     ,2     .2     .l! 

1 .7     .6     .6     .8     .8     .8     .7     .8     .9     .8     .7     .8     .9  1.0 
.3     .4     .4     .2     .2     .2     .3     .2     .1     .2     .3     .2     .1        0 

1.0  .8     .8     .9     .9) 


0    .2    .2    .1    .r 


Next  we  multiply  these  matrices  to  obtain 


12.16    3.14  11.99    2.91 

NM'  =  \  \,        MM'  =  \  \- 

{  2.74       .96)  (  2.91     1.19i 

The  matrix  MM'  is  easily  inverted,  and  we  have 

(12.16    3.14M     1.19     -2.91/     i 
T  =  NM'iMM')-'  ==  \  \\  \t^, 

i  2.74       .96)  (-2.91       11.99)  ^-^ 

_^   (.92     .39/ 

( .08     .61 ) 

The  initial  distribution  do  is  (.5,  .5),  and  from  Eq.  (8)  we  obtain 

p'"\R)  =  .83  -  .33(.63)". 

The  values  calculated  from  this  equation  are  .500,  .665,  .738,  .785,  .804,  .  .  .  , 
approaching  .83  as  the  asymptote.  Note  that  we  do  not  have  a  least-squares 
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fit  of  this  function,  p^"'(i2),  to  the  observed  data;  we  have  a  least-squares 
fit  for  the  transformation  T. 

From  Eq.  (21)  we  can  calculate  the  corrections  that  are  added  to  A^: 

7^,^        (.655     .761  .708  .708  .814  .814  .814  .761  .814  .867 

(.345     .239  .292  .292  .186  .186  .186  .239  .186  .133 

.814  .761  .814  .867  .920  .920  .814  .814  .867) 

.186  .239  .186  .133  .080  .080  .186  .186  .133) 

(-.045        .161         .108     -.092        .014  .014  .114     -.039 

I     .045     -.161     -.108         .092     -.014  -.014  -.114         .039 

-.086        .067        .114     -.039     -.086  -.133  -.080 

.086     -.067     -.114        .039        .086  .133  .080 

.120  .014  -.086     -.033/ 

-.120  -.014  .086        .033) 
The  squared  deviations  are  given  by 

(-.144         .144) 

The  best  estimate  of  the  dispersion  of  the  calculated  from  the  observed 
values  is 

/         ""'      1  =  \fW  =  -092.  (24) 

In  —  a  —  1        \    17 

The.variance-covariance  matrix  V  is  given  by 

V  =  AMMr'=^\     '•''     -^H-  (25) 

1-2.91       11.99) 

From  Eq.   (25)  we  compute  the  standard  deviations  of  the  estimates  of 
p^{A)  and  VeiB): 


c[pM)]  =  .092  J~~  =  .04  , 


a[p,{B)]  =  .092  ^^  =  .132. 

The  same  procedure  can  be  applied  to  the  data  from  a  single  animal. 
The  data  matrices  M  and  A'^  then  have  either  0  or  1  on  successive  trials;  e.g., 


I 


0     1 

1 

1     0 

0 

1     1 

1 

0    0 

0 

mil) 

0 

( 

0 

m(0) 

\ 
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^^jlOl       11001011 

(01000110100 

(O     1     1     1    0    0     1    0     1     1    0 

N  =  ^ 

(10    0    0     110     10    0     1 
In  order  to  solve  for  T  we  determine 

NM'=  h^^'^^     "^^^M,        MM'  = 
(m(l,0)     m(0,0)) 

The  symbol  m{i,j)  represents  the  number  of  occurrences  of  the  ordered  pair 
i^j;  m{i)  represents  the  number  of  occurrences  of  i;  and  m(0)  +  m{l)  = 
n  —  1,  where  n  is  the  number  of  trials.  Next  we  invert  MM'  and  solve  for  T: 

(mil,l)     m(0,l)^  (_}_       Q  ^ 

f  =  NM'iMM'Y"  =  I  \hri{l)        ^    \ 

(m(l,0)     m(0,0);  (    ^       ^) 


/m(M)  m(0,l)\ 

)   m(l)  m(0)  (^ 

)m(l,0)  m(0,0)  i 

\  m(l)  m(0)  / 


(26) 


Eq.  (26)  is  the  result  that  would  be  expected  from  the  definition  of  the 
transitional  probabilities. 

In  order  to  estimate  the  dispersion  we  calculate 

/m(l,l)     m(0,l)     m(l,l)  m(l,l)\ 

fM  =  )   "^^^^        "^^^^        "^^^^        "       "^^^^  '. 
)m(l,0)     m(0,0)     m(l,0)  m{\,G)i 

\  m{\)        m(0)        m(l)       ' ' "       m(l)  / 

Then  from  Eq.  (21)  we  find 

/     m(l,l)     -m(0,0)     -m(l,0)  -m(l,0)\ 

C  =  TM-N  =  )       "^^^^  "^^^^  "^^^^        ' " '         "^^^^     V 

)-m(l,l)         m(0,0)         m(l,0)  m{\,0)i 

\     mil)  m(0)  m(l)       ' ' "  m(l)  / 

The  squared  deviations  are  given  by 

CC  = 
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where 
c  =  m(l 


4"^] + M^^^i + M'm] 


=  [m(l,0)  +  m(l 


[m(l,l)m(l,0)'] 
'  ^i  m(l)    w(l)  J 


=  m(l) 


m(l)    w(l) 
+  m(0) 


+  [m(0,0)  +  m(0,l)] 


"m(0,l)m(0,0)1 
w(0)    m(0)  J 


[^(1,1)      m(l,0)1        „.^Jm(0,l)      m(0,0)1 
L  m(l)    ■     m(l)  J  "^  ^1  m(0)    '     m(0)  J' 


The  dispersion  is,  therefore, 
\n  —  1  —  a 


m(l) 


w  —  3 


m(l,l)      m(l,0) 
m(l)         w(l) 


+ 


r  m(( 
Ln  — 


(0)       m(0,l)     m(0,0)iy^ 
3  ■     m(0)    ■     m(0)  Jj 


(27) 


The  variance-covariance  matrix  is 


|m(l) 


F  =  a\MMO 


/^-l  - 


w  —  3 


m(0). 


and  from  this  matrix  we  compute ' 

a[pA(A)]  =  a^ 


1 

m(l) 


and        o-[ps(5)] 


""VmCO) 


(28) 


Although  these  examples  are  worked  out  for  the  Markov  case  with  two 
alternatives,  the  same  procedures  can  be  used  with  more  than  two  alternatives 
or  with  Markov  processes  defined  for  compound  responses.  It  should  be 
stressed,  however,  that  the  statistical  properties  of  Markov  chains  are  neither 
simple  nor  well  understood.  Better  techniques  will  undoubtedly  develop  as 
the  Markov  process  becomes  more  widely  applied. 

6.  Variable  Transformations.  Up  to  this  point  we  have  made  the  ex- 
plicit assumption  that  a  single  transformation  could  describe  the  successive 
changes  in  the  probabilities  of  the  alternative  responses  or  alternative  se- 
quences of  responses.  This  assumption  greatly  simplifies  the  theoretical 
landscape  and  should  be  made  whenever  the  data  hint  that  it  might  be  true. 
Simplicity  is  not,  however,  an  intrinsic  property  of  the  behavior  of  living 
organisms,  and  so  we  must  be  prepared  to  deal  with  situations  that  obviously 
violate  the  assumption. 
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The  assumption  that  a  single  transformation  is  adequate  means  that  the 
transitional  probabilities  are  fixed  from  the  first  through  the  last  trial.  Since 
the  transitional  probabilities  determine  the  sequences  of  responses  that  are 
probable  or  improbable,  we  are  assuming  that  the  animal's  course  of  action 
or  strategy  is  fixed  throughout  the  experiment.  In  a  certain  sense,  therefore, 
such  an  assumption  means  that  there  is  no  learning  at  all;  as  soon  as  the 
experimental  situation  is  encountered  for  the  first  time,  the  subject  adopts 
the  set  of  transitional  probabilities  that  will  later  describe  the  statistical 
properties  of  his  behavior  after  he  has  had  long  experience  in  the  situation. 

The  assumption  of  a  single  transformation  would  be  justified,  for  ex- 
ample, after  a  long  series  of  alternate  conditioning  and  extinction.  In  this 
experiment  the  subject  is  able  to  evolve  a  single  transformation  for  the  re- 
inforcement conditions  and  another  for  the  extinction  conditions.  Or  if  an 
animal  has  adopted  a  stable  mode  of  behavior  in  a  situation  and  then  is 
temporarily  distracted  in  some  way,  his  return  to  normal  when  the  im- 
pediment is  removed  might  be  expected  to  follow  a  single  transformation. 
But  in  most  of  the  situations  that  are  studied  experimentally  there  is  no  a 
priori  reason  to  expect  that  a  single  transformation  will  be  adequate,  and 
there  are  several  reasons  to  expect  that  it  will  not  be. 

In  order  to  illustrate  what  is  involved  in  the  assumption  of  a  single 
transformation,  Table  I  has  been  prepared  to  show  one  case  where  the 
assumption  is  correct  and  another  where  the  assumption  is  wrong.  Once  more 
we  consider  the  data  from  10  rats  on  20  consecutive  choices  in  a  T-maze. 
The  symbol  1  represents  a  correct  choice,  and  0  represents  an  incorrect  choice. 
In  Tables  lA  and  IB  the  numbers  of  rats  making  the  correct  choice  are  the 
same,  and  both  are  the  same  as  the  example  fitted  in  the  preceding  section. 


TABLE  1 
Hypothetical  Data  for  Ten  Rats  on  Twenty  Trials  in  a  T-Maze 


lA. 

Constant  Transformation 

Trial 

Rat 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14  15 

16 

17 

18 

19  20 

1 

1 

1 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

2 

0 

0 

0 

0 

0 

1 

1 

1 

0 

1 

3 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0    0 

4 

0 

1 

1 

1 

1 

0 

0 

0 

1 

1 

6 

1 

1 

1 

1 

1 

1 

1 

0 

0 

6 

0 

1 

1 

1 

1 

1 

1 

1 

1 

7 

0 

0 

0 

0 

1 

1 

1 

1 

1 

8 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

9 

0 

0 

0 

0 

1 

1 

0 

0 

1 

1 

10 

1 

1 

1 

1 

0 

0 

1 

1 

1 

1 

S 

5 

7 

6 

6 

8 

8 

8 

7 

8 

9 

8 

7 

8 

9  10 

10 

8 

8 

9     9 
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Rat 


12    3     4    5 


5     7     6     6 


TABLE  1  (Continued) 

IB.     Variable  Transformation 

Trial 
6     7    8     9    10  11  12  13  14  15 


7     8     9 


7     8    9  10 


16  17  18  19  20 


11 

1 

0 

1 

1 

1   1 

1 

0     1 

1 

1   1   . 

1 

1 

12 

0 

0 

0 

0 

1 

1   1 

1 

0 

1   1 

0 

1 

13 

1 

1 

1 

0    0 

1 

0    0     . 

0 

14 

1 

1 

1 

1   1 

1 

0     1     J 

1 

15 

1 

0 

1 

1     0 

0 

1   1   ] 

0 

0 

16 

0 

0 

1 

1 

1   1 

1 

1    1    ] 

1 

17 

0 

1 

0 

1   1 

1 

0 

1   1   J 

1 

18 

0 

0 

0 

1    1 

0 

1    1    J 

1 

19 

1 

0 

0 

0 

1 

0    1 

0 

1    1    J 

1 

20 

0 

1 

1 

1 

1 

1    1 

1 

0 

0 

1    1    ] 

1 

10    8    8    9    9 


From  the  data  in  Table  I  we  can  estimate  the  values  of  Pi(l)  and  Po(0) 
on  successive  pairs  of  trials  by  [m(i,j)]/m{i): 


lA  Trial 


Pi(l) 


Po(0) 


IB  Trial 


Pi(l) 


Po(0) 


1-2 

1.00 

0.60 

1-2 

0.60 

0.20 

2-3 

0.86 

1.00 

2-3 

0.72 

0.67 

3-4 

1.00 

1.00 

3-4 

0.60 

0.50 

4-5 

1.00 

0.25 

4-5 

0.83 

0.25 

5-6 

0.88 

0.50 

5-6 

0.75 

0.00 

6-7 

1.00 

1.00 

6-7 

0.88 

0.50 

7-8 

0.88 

1.00 

7-8 

0.75 

0.60 

8-9 

1.00 

0.67 

8-9 

0.72 

0.00 

9-10 

1.00 

0.50 

9-10 

0.88 

0.50 

10-11 

0.89 

1.00 

10-11 

0.78 

0.00 

11-12 

0.88 

1.00 

11-12 

0.75 

0.50 

12-13 

1.00 

0.67 

12-13 

0.86 

0.33 

13-14 

1.00 

0.50 

13-14 

1.00 

0.50 

14-15 

1.00 

0.00 

14-15 

1.00 

0.00 

15-16 

1.00 

15-16 

1.00 

16-17 

0.80 

16-17 

0.80 

17-18 

0.88 

0.50 

17-18 

0.88 

0.50 

18-19 

1.00 

0.50 

18-19 

1.00 

0.50 

19-20 

1.00 

1.00 

19-20 

1.00 

1.00 

There  seems  to  be  a  clear  trend  in  IB  for  pi(l)  to  increase  on  successive 
trials,  whereas  no  trend  for  pi(l)  is  observable  in  lA.  If  we  group  the  trials 
by  fives  to  secure  more  reliable  estimates,  we  get 


I  Trials 

Pi(l) 

Po(0) 

IB  Trials 

Pi(l) 

Po(0) 

1-6 

0.94 

0.67 

1-6 

0.72 

0.33 

6-11 

0.95 

0.89 

6-11 

0.85 

0.30 

11-16 

0.98 

0.63 

11-16 

0.93 

0.38 

16-20 

0.92 

0.60 

16-20 

0.92 

0.60 
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Comparisons  such  as  these  show  that  the  assumption  of  a  constant 
transformation  cannot  be  checked  by  the  successive  distributions  alone,  for 
lA  and  IB  are  identical  in  this  respect.  The  assumption  is  justified  if  the 
analysis  of  short  sequences  of  trials  shows  relatively  constant  transitional 
frequencies,  as  in  lA.  If  the  transitional  frequencies  show  a  definite  trend, 
as  in  IB,  the  assumption  is  not  justified. 

The  question  is  what  to  do  when  we  face  variable  transformations. 
Whatever  we  do,  the  situation  will  not  be  simple.  If  .  .  •  PQRST  do  cannot 
be  translated  into  .  .  .  TTTTT  do  ,  the  matrix  products  may  get  quite  com- 
plex. If  we  could  choose  P,  Q,  R,  S,  T  as  commutative  matrices,  it  would  be 
possible  to  find  a  simultaneous  solution  for  all  of  them;  all  matrices  would 
have  the  same  characteristic  vectors  but  different  characteristic  roots.  Un- 
fortunately, however,  it  does  not  seem  possible  in  general  to  choose  com- 
mutative matrices  with  the  properties  demanded  by  the  data. 

If  the  complexity  of  the  problem  is  admitted  as  inevitable,  we  can  still 
look  for  a  matrix  function  of  n,  T(n),  that  changes  in  some  reasonable  way 
on  successive  trials.  The  following  argument  illustrates  one  possible  approach. 
We  assume  that  at  the  beginning  of  the  experiment  the  subjects  are  equipped 
with  transitional  preferences  given  by  the  matrix  U.  After  long  experience 
in  the  situation  the  subjects  develop  transitional  preferences  given  by  the 
matrix  V.  As  the  experiment  progresses  the  tendencies  represented  by  U  are 
slowly  extinguished  and  those  represented  by  V  are  slowly  strengthened. 
Consider  the  following  sequence  of  equations: 

T(0)  =  U 

T{1)  =  wT{0)  +  (1  -  w)V 

T(2)  =  wT(l)  -^  {1  -  w)V  (29) 


T(n)  =  wT{n  -  1)  +  (1  -  w)V, 

where  0  <  w  <  1.  The  rationale  for  this  set  of  equations  is  that  w  represents 
the  perseveration  of  the  tendencies  on  the  preceding  trial,  and  (1  —  w) 
represents  the  ability  to  adopt  the  new  mode  of  response  symbolized  by  V. 
If  the  extinction  of  the  old  pattern  of  responses  is  slow,  w  is  near  unity;  if 
the  old  pattern  extinguishes  rapidly,  w  is  near  zero. 
Eq.  (29)  can  be  written  in  terms  of  U  and  V: 

T{0)  =  U  =  w\U  -  V)+V 

T{\)  =  wU  +  {\  -  w)V     =  w\U  -  V)  +  V. 

T[2)  =  w'U  +{l  -  w')V  =  w\U  -  V)  +  V.  (30) 


T{n)  =  vfU  -f-  (1  -  w'')V  -  w\V  -  F)  -h  F. 
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In  this  form  it  is  clear  that,  since  0  <  w;  <  1,  T(n)  approaches  F  as  n  in- 
creases. The  importance  of  U  becomes  progressively  smaller  as  the  subject 
has  more  and  more  experience  in  the  experimental  situation.  This  formulation 
has  the  advantage  that  it  is  relatively  easy  to  compute  the  successive  values 
of  T{n),  given  U  and  V.  The  initial  and  final  matrices,  U  and  V,  can  be  given 
theoretically  or  can  be  determined  from  data  obtained  prior  to  the  first  trial 
and  after  the  learned  behavior  has  stabilized  again  in  the  new  course  of  action. 
For  illustrative  purposes,  assume  that  U  and  V  are  known  to  be 

and         V  = 


and  that  the  weight  w  is  calculated  to  be  0.8.  Then  Eq.  (30)  gives 

.9     .4) 
.1     .6i 

Then  on  successive  learning  trials  we  have : 

n:    0    1        2        3        4        5        6  7  8  9  10--- 

Pa{A):  .5  .58  .644  .695  .736  .768  .796  .816  .832  .846  .857  ••• 

PsiB):   .5  .52  .536  .549  .559  .567  .574  .579  .583  .587  .589  ••• 

Next  we  calculate  the  proportions  of  right  and  wrong  responses  on  successive 
trials.  This  is  given  by  the  equation : 

no)rfo  =  d, 

T{i)d^  =  d2  =  Ta)no)do 

T{2)d,  =  ds     =  T{2)T{\)T{G)do  (31) 


T{n)d,  =  rf„,i  =  n  T{i)do. 

n 

It  is  assumed  that  T{Qi)  =  U  and  do  are  known  from  preliminary  experi- 
mentation. Assume  the  boundary  condition  d'o  =  (.5,  .5).  Then  direct  com- 
putation gives  the  values : 

n:    1      2  3  4  5  6         7  8  9        10       •  •  •       « 

p{R) :  .5     .53     .559     .587     .614     .639     .662     .683     .700     .716     •  •  •     .800 

Considerable  care  must  be  taken  with  such  iterated  computation,  for  the 
errors  are  cumulative. 

It  should  be  noted  that  if  it;  =  0,  the  variable  case  reduces  to  the  constant 
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case,  for  then  T(n)  =  V  and  UT(i)  =  T".  Similarly,  iiw  =  I,  then  T(n)  =  U 
and  we  again  have  a  single  transformation. 

A  special  case  arises  if  U  and  V  commute,  UV  =  VU,  for  then  T(n) 
and  T{n  +  k)  also  commute.  If  two  matrices  with  distinct  roots  commute, 
then  one  can  be  written  as  a  polynomial  in  terms  of  the  other,  with  scalar 
coefficients.  Thus  if  the  matrices  A  and  B  commute,  we  can  write,  according 
to  Eq.  (15)  and  (16), 

B  =  \J,{B)  +  \J,(B)  +  •  •  •  +  \NfN{B)  ^22) 

A  =  g(B)  =  g(\,)fr(B)  +  g(X,)f,(B)  +  •  •  •  +  gMf^(B), 

where  Xj  is  the  characteristic  root  of  B;  g(\i)  is  the  characteristic  root  of  A; 
and  for  matrices  of  transitional  probabihties  Xi  =  giXi)  =  1.  Thus  A  and 
B  have  different  roots,  but  /.(A)  =  fi{B).  Another  way  of  saying  the  same 
thing  is  to  note  that  commutative  matrices  are  transformed  into  their  diagonal 
form  by  the  same  operator.  Thus  if  S  transforms  A  into  the  diagonal  form 
Aa  ,  S  also  transforms  B  into  its  diagonal  form  As  .  The  product  of  A  and 
B  is  (since  the  diagonal  matrices  A^  and  A^  obviously  commute) 

AB  =  {SAaS-')(SAbS~')  =  SAaAbS-'  =  SAbA^S'' 
=  (SAsS-'XSAaS'')  =  ba. 
If  the  matrices  T(i)  commute,  then 

Um  =  s\Ua{{)']s-\  (33) 

where  the  A(*)  are  the  diagonal  matrices  similar  to  T{i).  The  product  of  the 
T(i)  reduces  to  the  product  of  diagonal  matrices.  If  all  of  the  A(iys  are 
equal,  then  Eq.  (33)  reduces  to  the  constant  case  given  by  Eq.  (5). 

Commutative  matrices  occur  when  the  distribution  over  the  several 
alternative  responses  does  not  change,  although  the  transitional  probabilities 
do  change.  If  U  has  been  apphed  repeatedly,  U"  approaches /i(C/)  as  a  limit; 
after  V  has  been  applied  repeatedly,  F"  approaches /i(F).  When  U  and  V 
commute,  fi(U)  =  fi(V),  and  so  both  transformations  lead  to  the  same 
stable  distribution.  Such  a  situation  might  arise  in  learning  a  simple  alterna- 
tion between  left  and  right.  The  learning  might  leave  p(L)  =  p{R)  =  .5, 
although  the  transitional  probabilities  were  altered. 

This  discussion  of  learning  should  suggest  sOme  of  the  descriptive  possi- 
bilities of  systems  of  dependent  probabilities.  By  this  general  development 
we  arrived  at  a  mathematical  description  of  complex  behavioral  changes — 
a  description  that  enables  us  to  talk  about  the  gradual  replacement  of  one 
pattern  of  responses  by  another. 
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ON  THE  MAXIMUM  LIKELIHOOD  ESTIMATE  OF  THE 
SHANNON-WIENER  MEASURE  OF  INFORMATION 

George  A.  Miller 

AND 

William  G.  Madow 

The  limiting  form  and  the  first  two  asymptotic  moments  of  the  sampling  distri- 
bution of  the  maximum  likelihood  estimate  of  the  Shannon-Wiener  measure  of  amount 
of  information  per  observation  drawn  from  a  multinomial  distribution  are  determined. 
Also,  approximations  to  the  bias  and  the  mean  square  error  of  the  estimate  are  given. 

Preface 

The  statistic  defined  by  Shannon  (3)  and  by  Wiener  (4)  to  measure  the  amount  of 
information  in  an  event  drawn  from  a  multinomial  distribution  has  been  adopted  by 
some  psychologists  to  measure  certain  aspects  of  stimulus  and  response  events  in 
psychological  experiments  (2).  In  these  applications,  however,  the  psychologist  is 
usually  forced  to  work  with  relatively  small  samples  and  the  sampling  distribution  of  the 
measure  becomes  of  real  interest.  In  the  present  paper  the  first  two  moments  of  the 
asymptotic  distribution  are  derived  and  the  bias  of  the  statistic  for  small  samples  is 
explored. 

1 .  The  Limiting  Distribution  of  the  Maximum  Likelihood  Estimate 
of  Amount  of  Information 

If  an  experiment  or  operation  has  k  possible  results,  the  /th  of  which  has  prob- 
ability Pi  >  0,  i  =  \, .  .  .  ,k,  the  Shannon- Wiener  measure  of  the  amount  of  informa- 
tion per  performance  of  this  operation  or  event  is 

k 

H  =  -  J,  Pi  ^og2  Pi. 

We  propose  to  consider  the  properties  of  the  maximum  likelihood  estimate 
H'  of  H  obtained  from  n  independent  performances  of  the  operation.  Since  if  is  a 
continuous  and  diff'erentiable  function  of/?,, .  .  . ,  /(^  for  all  positive  values  of  the  prob- 
abilities, it  follows  that  the  maximum  likelihood  estimate,  H',  is 

i = 1  «      ^   n 

where  n^  is  the  frequency  with  which  the  /th  of  the  k  possible  outcomes  occurs  in  the  n 

This  article  is  from  the  Operational  Applications  Laboratory,  Air  Force  Cambridge 
Research  Center,  Air  Research  and  Development  Command,  Boiling  Air  Force  Base,  1954, 
AFCRC-TR-54-75,  contract  AF  18(600)-322.    Reprinted  with  permission. 
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performances,  and  where,  if  «j  =  0  for  one  or  more  values  of  /,  we  define  the  corre- 
sponding terms  (njn)  logg  «,/«  of  H'  to  be  0. 

We  will  now  show:  (a)  If  the pi  are  not  all  equal,  then  H'  has  a  normal  limiting 
distribution;  and  (b)  if  pi  =  Ijk,  i  =  I,  .  .  .  ,  k,  then  H'  has  a  chi-square  limiting 
distribution  with  k  —  \  degrees  of  freedom. 

As  a  preliminary,  we  obtain  H  —  H'  in  a  form  that  simplifies  the  further 
calculations. 

Lemma.     The  difference  H  —  H'  is  given  by  the  following  equations: 
Let 

.Ci  n      °  npi 
and 

K=%^\j^  -PiJH^Pi.  (1) 

Then 

H  -H'  =  Un  +  Vn 

where  p^  >  0,  i  =  \,  .  .  .  ,  k.  Terms  in  C/„  that  have  ni  —  0  are  themselves  defined  to 
vanish,  but  terms  in  Vn  that  have  n j  =  0  still  yield  —pi  logg  /?,. 

Proof.    By  simple  substitutions  we  can  expand  H'  as  follows : 
i  =  \  n  npi      i^i  n 

=    -17  10g2  TiZ   -'ly^   -Pi)  lOgaA   -  2   Pi  ^°^2Pi 

i  =  i  n  npi      i  =  i\n  /  i  =  i 

All  we  need  to  do  is  verify  that  the  eff'ects  of  rt^  =  0  are  as  stated.  Suppose,  for  example, 
that  «i  =  0  but  ni  >  0  otherwise.    Then  from  the  definitions  of  H  and  H'  we  have 

k  n-         n  ■ 

H-H'  =  -^Pi\og^Pi  +2  -floga-: 

i  =  l  1  =  2  "  " 

and 

-       Un  =1  Tioga—  =2  rloga-  -I  -^og^Pi 

i  =  2  n  npi       j  =  2  «  «        i  =  2  « 

77- 

J^n   =2    -10g2/'i    +^. 

i  =  l  " 

so  that  if  we  combine  the  values  of  Un  and  V^,  we  verify  that  H  -  H'  =  Un  +  V^. 

Theorem  1. 

a.  If  the  Pi  are  not  all  equal,  then  Vn(H  -  H')  has  a  normal  limiting  distribution 
with  mean  0  and  variance 

i  =  l 
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b.  If  Pi  =  Xjk,  i  =  \,  .  .  .  ,k  then  (2n/log2  e)  {H  —  H')  has  a  chi-square  limiting 
distribution  with  k  —  1  degrees  of  freedom. 

The  first  part  of  Theorem  1  holds  for  maximum  likelihood  estimates  almost 
without  exception  (e.g.  [1],  p.  500).  Also,  maximum  likelihood  estimates  are  asymptoti- 
cally efficient.  We  will  prove  both  parts  of  the  theorem  since  most  of  the  calculations 
made  would  be  needed  in  any  case  for  the  asymptotic  moments.  Because  of  the 
preceding  lemma,  the  problem  of  evaluating  V«(//  -  H')  can  be  replaced  by  the 
equivalent  problem  of  evaluating  "^nUn  +  ^nVn. 

Proof.  We  first  note  that  if  the  pi  are  not  all  equal  then  V«  F„  has  a  normal 
limiting  distribution  with  mean  0  and  variance  a^.  We  sketch  the  proof:  The  random 
variables  ^ninjn  —  pd,  i  =  I, .  .  .  ,k  -  \,  have  a.  (k  —  l)-variate  limiting  normal 
distribution  with  mean  values  0,  variances  p^qi,  (qi  =  1  -  pi),  and  covariances  -piPj, 
i,j  =  \,  . .  .  ,k  —  \,  (i  7^ p.  Since  the  log  pi  are  constant  weights  applied  to  these 
random  variables,  it  is  clear  that  Vn  F„  is  a  linear  combination  of  the  random  variables. 
Therefore,  ^nVn  has  a  limiting  normal  distribution  with  mean  value 


VnEVr,  =  ^nEj^  I-  -  pA  loga/?, 


=  ^n^\og2PiE\-^  -pi)  =0, 


i  =  l 


and  a  variance 
ff2  =  Var 


t/"\^  -/'Jloga/'^ 


k 


=  2(log2/'^)'Var 


n 


+  ^Oog2 Pi)  (iog^Pj)  Cow 


i^j 


Vn\i-p, 


=  2  i^O^iPifPi^i   -  2.(l0g2/'t)  i^Og2Pj)PiPj 


k 


k 


=  2  Pi  0og2/'i)^  -  2  (Pi  loga/^i)  (Pi  loga/?,) 

i  =  l 

=  lPi(^og,Pi)'-fJ' 

1  =  1 


=  1  Pi  i\og.  Pi  +Hf 


We  next  show  that  ^  «f/„  converges  in  probability  to  zero  as  n  increases,  and 
that  2nUjlog2  e  has  a  chi-square  limiting  distribution  with  k  —  1  degrees  of  freedom. 
Let  us  define 

X  •    =   ,    /    =    1 ,  .   .   .  ,  K. 

nPi 
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Then 


if  1  n      ^  npi 

k 
=    2    Ml    +  ^t)  log2  (1    +  ^i), 


(2) 


and  since  /7j  >  1,  it  follows  that  Xi  >  .-c.^  =  —  1  +  l/np^  >  —  1.   Hence  we  can  apply 
Lemma  A.2*  and  we  obtain 


log: 


1  * 

Un   =    ^  Pi 


where 


=1     L     f^P 


m 


"i  -  "Pi        4r     i-'^y   i^i  -  np>i 


^  =  2  K^  ~  1)  \      "P 


+  ^;+i 


(3) 


""^•^^-JiyxTn) 


"/'i 


"/'i 


j+1       i 


Pi       \ni  -  npi\J+^ 


ijXj+l)       (npiY 


Furthermore,  since 


we  have 


2  ("t  -  npi)  =  0, 


i  =  l 


where 


log2^ 


J^'Ui<l 


'2v{v  -  l)i^i     (npi) 
1 


(4) 


*      |«,    -  «/7,|^+l 


y(y  +  1)  iti      (npiY-^ 

It  follows  from  (2)  that  we  do  not  need  any  special  treatment  of  terms  with 
rtj  =  0  in  the  approximations  to  t/„  yielded  by  (3),  since  the  appearance  of  «j  as  a 
multiplier  will  automatically  cause  the  corresponding  term  of  (2)  to  vanish  when 
Hi  —  0.  The  elimination  of  terms  involving  «,  =  0  has  made  it  possible  for  the  re- 
mainder terms  to  be  bounded,  for  if  we  did  not  require  rif  logg  «;  to  vanish  when  «j  =  0, 
it  would  follow  that  there  would  be  positive  probability  that  H'  would  be  indeter- 
minate. 

Furthermore,  from  Lemma  B.l  and  Lemma  C.2  it  can  be  seen  that 


Pr(R';+,  >  e)=0 


1 


1 

,2;-2-j-l  /     ^         \  fjj-3 


and 


ER''      =  O,    .  , 


J  +  1 
,  where  rj  =  — - —     itj  is  odd 


=  -     ify  is  even. 
Actually,  it  is  easy  to  see  from  (4)  that  we  have 

(-iv+^  4,  («,  -  «/;,v+i        (-1P+2      ^  („.  -  np,y+^ 


^i+l 


(y  +  i)/\=i     (my 


(j  +  2)(j+l)if,      {npd^ 


*  The  letter  "A"  in   "Lemma   A.2"    indicates   that   this   lemma  will  be  found  in 
Appendix  A. 
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and  hence,  symbolically, 

^;^:=o(jzj)  +  o(i)  +  o(;l5)=o(;lj).  (5) 

Thus  Eq.  (5)  shows  that  the  upper  bound  of  0(l/«-'"^)  that  we  have  found  for  R'j^^  is 
unnecessarily  large,  but  the  above  device  is  sufficient  to  prove  that  R'^+i  and  ER'I^^ 
converge  to  0  as  fast  as 

i=i    (npiy 


and 


2n  Un 

Now  the  first  term  of ; is 

logge 


^it'i     (my 


^  (n;  -  npif 


i  =  i        f^Pi 

which  is  well  known  to  have  a  chi-square  limiting  distribution  with  k  —  I  degrees  of 
freedom,  whereas  all  other  terms  of  2«  t/„/log2  e  converge  in  probability  to  zero  by 
Lemma  C.2.  Hence  2n  UJlog2  e  has  a  limiting  chi-square  distribution.  On  the  other 
hand,  since 


\log2^/\2V„/ 


is  the  product  of  a  random  variable  that  has  a  limiting  distribution  by  a  variable  that 
converges  to  0,  it  follows  that  ^ n  U„  converges  in  probability  to  0. 

Thus,  if  the/7j  are  not  all  equal,  Vn{H  —  H')  =  VnVn  +  ^nU„  is  the  sum  of 
two  random  variables,  one  of  which  has  a  normal  limiting  distribution,  whereas  the 
other  converges  in  probability  to  zero.    Hence  ^n{H  —  H')  has  the  same  limiting 

distribution  as  ^nV^. 

On  the  other  hand,  if  the  pi  are  all  equal,  then  K„  =  0  and  [2/;/(log2  e)]  (H  —  H') 
has  the  same  limiting  distribution  as  [2«/(log2  e)]  U„,  namely,  chi-square  with  k  —  1 
degrees  of  freedom. 

2.   The  Limiting  First  Moment  of  H  —  H' 
By(l) 

EH'  =  H  -  EUn  -  EV^. 

Since  EVn  =  0,  it  follows  that  —EUn  is  the  bias  of  H'.   In  order  to  evaluate  this  bias 
we  now  approximate  £(/„. 
From  (4)  we  have 

u„    _  J_  Y  <^"/  ~  "/^^)"  _  J-  y  ^"'  ~  "P'^^ 


loga^       2n  i^i       npi  ^n  f^^      (PPif 

1     ^  (n,-  -  np,f         1     ^(n,-np,f       1 
"^  I2n  iti     inp.f  20/7  {i^     {np,f  n     «' 
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From  Lemma  B.l  we  see  that 
EUn    ^  J_  y  npiqi  _  J_  y  npiqlqi  -  p,)       J_  y  3«^/?|^|  +  npiqlX  -  ^piq,) 

_  J_  |.  lOn^pfqfiqi  -  Pi)  +  npM^J  -Pi)0-  -  12/;,^^)       ^ 

A:-l         1    \qMi-pd    ,      1      v!?i    ,   ^/1\ 


or,  combining  terms,  we  have 

EU^        k-l  1     ^1  -/'f  _^  _/l 


_  k  -  1  1  J_  y  1  /1\ 

Hence,  an  estimate  of  H  that  is  unbiased  to  terms  of  order  1  /« is  H'  +  (iogg  e)(k  —  l)/2«, 
and  an  estimate  of  H  that  is  unbiased  to  terms  of  order  1/rt^  is 

zj'  ^  n         ^  ^  ~  ^       ^og2g    ,  ^og2g  V    ^ 

^  +(i«g^^>^;r-T2;;^  +  i2;;^,?,^' 

Thus,  we  have  proved  the  following  theorem : 

Theorem  2.     Under  the  stated  conditions 

[k  -\  1  1      ^    1\  /I 

7/-^i/'=log,.(^^-— ,+^,I-J+0(;^ 

Furthermore,  if  we  let 

k  -  1 
H"  =H'  +{\og^e)—^ 

and  let 

H'"   =  f/"_!^S2f    ,    l^g2f  y    1 

if  =  EH'  +  0(l/«), 

H  =  EH"  +  0{\ln% 
and 

H  =  EH'"  +  0{\ln% 

Theorem  2  enables  us  to  make  several  observations  about  the  bias:  (1)  the  term 
of  order  n"^,  namely,  (/c  —  l)/2rt,  does  not  depend  on  the  probabilities  /?,  and  hence 
H"  has  a  bias  of  lower  order  than  H'  for  all  values  of  the  /?,;.  (2)  Since  {k  —  \)j2n  and 
{(^Ijpd  —  l]/12rt^  are  both  positive  quantities,  H'  is  biased  downward  even  to  terms 
of  order  «"^  for  all  possible  values  of  the  /?,.  (Terms  of  higher  order  may  be  negative, 
of  course,  so  that  EH"  or  EH"  may  be  greater  than  H  for  small  values  of  n.)   (3)  An 
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TABLE  I 

Expected  Values  of  the  Estimators  H',  H",  and  H'  for  the  Binomial 
Case  when  pi  =  0.50  and  when  pi  =  0.05  for  Sample  Sizesup  to  20 


When  Pi  = 

0.5 

When  Pi  = 

0.05 

Sample  size 

EH' 

EH" 

EH' 

EH' 

EH" 

EH' 

1 

0 

.721 

1.082 

0 

.721 

3.132 

2 

.500 

.861 

.951 

.095 

.456 

1.058 

3 

.689 

.929 

.969 

.131 

.371 

.640 

4 

.781 

.961 

.983 

.153 

.333 

.484 

5 

.832 

.977 

.990 

.169 

.313 

.410 

6 

.865 

.985 

.995 

.181 

.301 

.368 

7 

.887 

.990 

.997 

.191 

.294 

.343 

8 

.903 

.993 

.998 

.199 

.289 

.327 

9 

.914 

.994 

.999 

.206 

.286 

.316 

10 

.924 

.996 

.999 

.212 

.284 

.308 

11 

.931 

.997 

1.000 

.217 

.283 

.302 

12 

.937 

.997 

1.000 

.222 

.282 

.298 

13 

.942 

.998 

1.000 

.226 

.281 

.296 

14 

.947 

.998 

1.000 

.229 

.281 

.293 

15 

.951 

.999 

1.000 

.232 

.280 

.291 

16 

.954 

.999 

1.000 

.235 

.280 

.289 

17 

.957 

.999 

1.000 

.238 

.280 

.288 

18 

.959 

.999 

1.000 

.240 

.280 

.287 

19 

.961 

.999 

1.000 

.242 

.280 

.287 

20 

.963 

.999 

1.000 

.244 

.280 

.286 

CO 

1.000 

1.000 

1.000 

.286 

.286 

,286 

increase  in  bias  results  if  one  uses  [(k  —  l)/2«]  —  1/1 2«^  as  an  overall  correction  and 
omits  (L\jpi)ll2n^.  (4)  When  all  the pi  are  equal,  H"  becomes 


/k  -I       k^  -l\ 
^'4-log,e^^^+-^^j, 


which  is  a  lower  bound  for  H'"  (that  is  to  say,  if  the  pi  are  unequal,  21//?^  >  k^). 
In  order  to  illustrate  the  use  of  the  bias  corrections  of  Theorem  2  for  a  simple 
case,  we  state  the  following : 

Corollary.     For  the  binomial  case,  k  =  2,  we  obtain  the  following  estimates  of 
H  to  terms  of  order  «~^: 

Ifk  =  2  and  Pi  =  0.5,  then 

//"'=//-+ (log,  e)(i+^). 
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Ifk=l  and  Pi  =  0.05,  then 


H'  =H'  +ilog,e)\-  + 


1  381 


2«       228n' 

In  Table  1  the  expected  values  of  the  estimates,  EH',  EH",  and  EH",  are 
compared  with  H  for  the  binomial  case  for  sizes  of  sample  up  to  n  =  20.  When 
Pi  =  0.5,  samples  as  small  as  5  give  satisfactory  estimates  of  H,  but  v/hen pi  =  0.05,  the 
size  of  sample  needed  becomes  larger. 

3.   The  Limiting  Second  Moment  of  H  —  H' 

Since  E(H  -  H'f  =  EV^  +  lEU^Vn  +  EU^,  we  will  now  consider  each  of 
these  three  terms  in  order. 

a.  Evaluation  of  EV^ 

We  have  already  seen  that 


EVi=-  ipii^og,pi+Hf, 


(6) 


b.  Evaluation  of  EU^Vn 
By  (4),  we  have 


U  V    = 
where 


"1  4, 

-  Z("t  -  npr)\ogzPi 


log2  e  ^     (-1)"     ^ (tij  -  npiY 
J  L     «      .to  v{v  -  1)  >i   (npiY-^ 


+  J^U., 


1     ^ 

."  1  =  1 


10g2  g  p„ 


By  an  analysis  such  as  that  summarized  in  (5)  we  can  ignore  £"-^JVi  ^^  involving 
terms  that  approach  0  more  rapidly  than  the  terms  we  shall  retain.   Hence 


loga^ 


EU^V^ 


'       (-1)"    r  I  „(«z  -  "^r^'  ,  .   ^J"i-  "P^y("h  -  npn) 


(npiY 


logaA 


By  Lemma  B.3 


so  that 


Ph 

E[(nh  -  npf,)    ni]  = («;  -  npi) 


Now 


E(ni  -  npiYinf,  -  npn)  , 
^  =  2 7Ziy=i logs/'/* 


E(ni  -  npiY^^  Ph  , 

Z  7 ^1^=1 10g2/'A- 


so  that 


-^  Ph  logs /'ft  =  H  +pi  log2  Pi 

h 

^E(ni-npiY+^    ,    4,(;7,log2^^)£(«^--«;;,r+i 
J->  =  H   >  — ; r^irr h  Z  7       ^T^^i 
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logaA  + 


Pi  log  Pi       H 


+ 


I 

i  =  l 


10g2/7,:   +  /f 


(-ir    Ei,ni  -np,r 


(7) 


We  shall  want  to  retain  all  terms  of  (7)  in  order  (!/«)  or  lower.   From  Lemma 
B.  1  it  is  clear  that  we  need  consider  only  terms  to  /  =  4.  Hence,  we  begin  by  evaluating 

_  1  npiqiiqi  -  p,)  _  1  ^n^pW  +  «/^/^Xl  -  6/?,^,)        1    XOn^plq^qj  -  pd 
'~  2  npi  6  n^pl  12  «=>? 

where  we  omit  the  second  term  of  the  fifth  moment  of  «j  since  it  will  yield  a  term  of 
order  l/«^.   Then 


2     +4-(^^^-^^^+4^- 
By  substituting  in  (7)  we  obtain 

^3^  ^  -  ^  ,.2  P.  (log./.,  +  «)  +  g;;  I 

^4iaog.,,  +  ^)+|-i!2iii^i±^, 

since 


(8) 


2  A0og2/7i   +if)   =0. 

i  =  l 

c.  Evaluation  of  EU^ 

The  first  three  terms  of  the  approximation  to  ?7„  given  by  (4)  will  be  used,  namely, 
nUn        1   y  (nj  -  npif  _  1   Y  (»;■  -  «/?,)'^       J_  y  (tti  -  npif 


loga^       2ifi       npi  6jfi     (/7/7,)2 


(^/^i)' 


Since  the  details  are  very  tedious,  they  have  been  put  in  Appendix  D.   Here  we  state 
the  result. 

Theorem  3.     Including  terms  of  order  \jn  we  have 

/  nUr,^     k^-\       Ik-WjL    \       91c'  -  10k  +  1 
Vlogae/  4  12«    ii^pi  Un 

and  ifk  =2,  p^  =}/2  we  have 


£(:^T-?+' 


loga  ej      4       An 


GEORGE   A.    MILLER   AND    WILLIAM   G.    MADOW  457 


Finally,  from  (6),  (8),  and  Theorem  3,  we  obtain 
Theorem  4.     In  (general 


6" 


E{H  -  H'f  =\tpi  aog2/',  +  Hf  -  ^^i(log2/7,  +  H) 
^   21ogae|^log2/7,  +  H   ^   i\og,e)W  -  D 

(lop^eflk  -  114,    1        (log2e)2  /I 

but  if  all  the  Pi  are  equal,  then 

^      (log2e)2(A:2  -  1)       i\og,ef 
E{H  -  H'f  =      ^'      '2 +     .T  /  ("7^  -  ll)'^' 


(logs  e)^  /  1 


Furthermore,  ifk=2  and p^  =  3^,  //zen 


2        3(log2e)7«  +  1\     ,    ^/l 


In  Theorem  4  we  have  approximated  the  mean  square  error  of  H'  about  H. 
For  any  random  variable  H'  we  have 

E(H  -  H'f  =  g\,  +  {EH'  -  Hf 

where  o\'  is  the  variance  of  H',  i.e. 

a\.  =  E(H'  -  EH'f. 

Since  {EH'  —  H)  is  given  by  Theorem  2,  we  can  approximate  a^-  by  using 

1 


a^,  =  E(H  -  H'f  -  (log2  ef 


'{k  -if       k  -  \/  4.    \ 


4n^  1 2«^  \  ^'f  1  p 


+  0,    , 


where  the  mean  square  error  E{H  —  H'f  will  be  obtained  from  Theorem  4.    For 
estimating  H,  the  mean  square  error  is  the  more  fundamental  quantity. 

By  way  of  illustration,  consider  the  binomial  case  where  k  =  2  and  p^  =  0.5. 
Then  we  have  the  approximation 

2         3  (loga  ef/n  +  1\        (logg  ef  (n  +  1\    _  {^og^/n  +  V 


^'  4/72       \     n     J  4«2     \     n     I  2n^ 

Appendix  A.     An  Expansion  for  (1  +  a;)  log  (1  +  x) 

We  begin  with  an  expansion  of  log  (1  +  x)  and  then  derive  the  expansion  of 
(1  +  x)\og{\  +x). 

Lemma  A.  1.     Let  —1  <Xq<x.    Then 

log  (1   +  ^)  =  a;  -  ^  +  •  •  •  +  (-1)^-1  -  +  i?,.+i,  (A.l) 
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where 

and  hence,  ifx^  <  0, 

while,  if  Xq  >  0 

1^1  3+1 

l«,«l  <  (-^)  .  (A.4) 

Proof.    If  - 1  <  x,  then 

and,  if  we  expand  1/(1  +  t),  we  obtain 

logo  +:r)  =2  (-1)^-1  -  +(-iy      — —  c//. 

1  =  1  I  Jo  1   +  ' 

Thus  (A.l)  and  (A.2)  hold.  Then  (A.3)  follows  from  the  fact  that  1/(1  +  t)  <  1/(1  +  x^) 
Hxq  <  0  and  (A.4)  follows  from  the  fact  that  1/(1  +  ?)  <  0  if  a;,,  >  0. 

Lemma  A.2.     Let  —  1  <Xq  <x.   Then 

(1  +x)\og{\  +x)=x  -vj^  7-—^  +  R]+v  (A.5) 

1  =  2  V    ~   IJ' 

^-—dudt,  (A.6) 

)  1  +  w 

fl«c/  hence,  ifx^  <  0,  ?Ae« 
w/f//^,  //«Q  >  0,  then 

Proof.    If  —  1  <x,  then 

r  1  +  ^  ^ 

dt  =  X 

Jo  I  +t 

and  also,  integrating  by  parts, 

I  Y^t  "^^  ^  ^^^  "^  ^^  ^°^  ^^  "^  '^^»  "  f '^°^  ^^  "^  '^  '^^' 

so  that 

(1  +  a;)  log  (1  +  a;)  =  X  +      log  (1  +  t)  dt. 

From  Lemma  A.l,  it  follows  that  if  a;  >  —1,  then 

logd  +t)dt=2  7-^-+      (-1)'      ——dudt, 

Jo  i=2  0   -  1>        jo  jo  1    +  « 
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and  hence  (A.5)  and  (A. 6)  hold.   If  a^o  <  0  then,  from  (A. 3)  it  follows  that 

'^'+''  ~Jo    (1  +^o)/^  ^  (1  +  ^o)/'0-  +  1) 
so  that  (A.7)  is  proved.   Then  (A.8)  follows  in  a  similar  fashion  from  (A.4). 

Appendix  B.     Multinomial  Moments 

Let  an  operation  having  k  possible  outcomes  be  independently  performed  n 
times  and  let  n^  be  the  number  of  occurrences  of  the  /th  of  the  possible  outcomes  in  the 
n  performances.  Then, 

n\ 


\  nJ 


]Pl"'P2"^~---Plc' 


«i!  «2!  •••«*! 

is  the  probability  of  obtaining  any  specified  values  of  «i,  •  •  • ,  W/,  where  «!  +  •••  + 
«j.  =  n,pi  ^  0,/?!  +  •  •  ■  +/?fc  =  1,  and/j^  is  the  probabiHty  of  the  occurrence  of  the 
/th  of  the  possible  outcomes  in  each  operation,  i  =  \,-  •  ■  ,k. 

Then,  it  is  possible,  by  easy  but  tedious  calculations  to  prove  the  following 
lemma. 

Lemma  B.L     The  first  six  moments  of  n^  are  given  by  the  following  equations. 
Eni  =  npi 
Eirii  -  np^)^  =  npiqi     (where  q^  ^  \  -  p^) 
Eiiti  -  np,f  =  npiqlqi  -  p,) 
Eiiii  -  np^^  =  ^n^p\qi  +  np^qlX  -  epiq,) 
E{ni  -  np^^  =  lOn^plqliqi  -  p,)  +  npiqi(qip,)(\  -  np^q,) 
E{ni  -  np,f  =  \5nYiq\  +  5n^pfqf[5  -  26/7,^,]  +  «/7,^,[l  -  30/?,^,  +  \20pfqf]. 

In  general,  if  m  is  an  integer,  then 

E(ni  -  npif^  =  Oin"^) 
and 

E(ni  -  npi)'^'^^^  =  0(«™). 

The  proof  of  Lemma  B.l  is  omitted. 

We  shall  need  not  only  the  moments  of  Wj  about  its  mean  but  also  certain  of  the 
product  moments 

E(ni  -  npi^inj  -  npjf. 

The  following  lemma  will  be  helpful  in  deriving  these  moments.  Its  usefulness  results 
from  the  fact  that  the  needed  conditional  moments  will  be  obtainable  easily  from 
Lemma  B.l. 

Lemma  B.2.     Let  x'  be  a  random  variable  and  let  A'  be  a  random  event.    Then 

E{x'  -  ExJ  =  y        ^     '     ^,  E{\E{x'  I  A')  -  Ex'Y'^'EiVx'  -  E{x'  I  A')Y  1  A')}.  (B.l) 

Proof.    In  general,  if  u  is  any  random  variable,  then 

Eu  =  E{E{u  I  A')}  (B.2) 
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where  A'  \s  a  random  event  and  the  "  |  "  denotes  conditional  expectation.  To  apply  to 
general  formula  we  put  u  =  (x'  —  Ex'y  and  note  that 

"            I'' 
Cx'  -  Ex'y  =  y   — — ^ .  [E{x'  I  A')  -  Ex'Y-'^ix'  -  E{x'  I  ^')f  • 

^^0  a!  ("  -  a)!  ' 

Then,  since 

E{[E{x'  \A')  -  Ex']'-''  \A'}  =  {E{x'  \  A')  -  Ex'Y''', 

the  Lemma  follows  by  substituting  for  x'  —  Ex'  in  (B.l). 

Since  we  will  apply  (B.l)  for  i*  =  1,  2,  3  in  the  following  Lemma,  we  now  write 
out  E[{x'  -  Ex'Y  I  A']  for  t-  =  1,  2,  3. 

E[{x'  -  Ex')  I  ^']  -  E{x'  \A')-  Ex'  (B.3) 

E[{x'  -  Ex'f  I  A']  =  [E(x'  I  A')  -  Ex'f  +  E{[x'  -  E{x'  \  A')f  \  A'}  (B.4) 

E[(x'  -  Ex'f  I  A'\  =  [E(x'  I  A')  -  Ex'f  +  3[E(x'  \  A')  -  Ex']E{[x'  -  E(x'  \  A')f  \  A'} 

+  E{[x'  -  E(x'  I  A')f  I  A'}.  (B.5) 

Let  us  now  evaluate  some  joint  moments  for  a  multinomial  distribution.   In  all  cases, 
the  random  event  A  will  be  "«j  has  a  specified  value." 

Lemma  B.3.     IVe  assume  a  multinomial  population  and  suppose  i  j^j.    Then 

Pi 

E(nj  I  «,)  =  npj (Hi  -  npi) 

Pi 
E[{nj  -  npj)  I  «J  = («i  -  np,) 

Hi 

E\{nj  -  np,f  I  «,]  =-\{ni-  np,f -^ («,  -  np,)  + 

Hi  Hi  Hi 

£[(«,•  -  npjf  I  «J  =  -  ^  («,;  -  npif 

Pi 

,      ^    PfJHi    -  Pi)  r  .2  1        /^|(^»    -  Pi)  t  . 

+  3  — 1 {rii  -  np,f  -  3« ^ («,■  -  np,) 

Hi  Hi 

PiiHi  -  Pi)(Hi  -  ^Pi)  t                  ^    ,    ^Pi^Hi  -  Pi)iHi  -  ^Pi) 
3 («i   -  f^Pi)   H 2 • 

Hi  Hi 

Proof.     If  «,  is  fixed  the  conditional  size  of  sample  is  «  —  «j  and  the  conditional 
probability  of  the  /th  possible  outcome  is  pjiqi.   Hence 

Pi                     Pi 
E(nj    ni)  ={n  -  «,)  —  =  npj (rt^  -  npi). 

Hi  Hi 

Pi 
E[inj  -  npj)  I  «,]  =(n  -  n,) np. 

Hi 

Pi 
= {rii  -  npi). 

Hi 
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Also 


Hence 


E{nj  I  «,)  -  Erij  = {tii  -  np,). 

1i 


E[(n,  -  npjf  I  Az,]  =  4  («^  -  Wif  +  («  "  «^•)  ^'^^'  2  ^''' 

=  —  («t   -  "Pi) 2 ("i  -  "Pi)   +  

<ji  qi  qi 

since  n  —  rii  =  nqi  —  (jti  —  npi).   Finally, 

qi  qi  qi 

,r       .Pj(qi-Pi\(qi-^p^ 


qi  \   qi 


Pl 


pfiqi  -  Pj) 


= 3  ("i   -  "PiT    +  3  3 ("i   -  "Pi) 


pfiqt  -Po) 


qi 


-  3rt  2 («^   ~  "Pi) 

qi 


pAqi  -  Po)(qi  -  ^Pi) 


(rtj  -  npi) 


+ 


npAq-;  -  Pj)(qi  -  ^pi) 


Hence 


E(ni  -  npi)(nj  -  np^)  =  E{ni  -  np,) 

P- 


qi 


npiqt  =  -npjCii 


E(ni  -  npifitij  -  npjf 


=  Eiiti  -  np,f 


'P' 


J ,  ,2     pMi  -Pi)  f  .  ^  "PMi  -Pi) 

%  (Hi   -  npiY 2 ("i   -  "Pi)   +  

qt  qi  qi 


qi 

pMi  -  Pi)      r        X  ,  "Pi^qi  ~Pi) 

2 —  "Piqiiqi  -  Pi)  +  — "Piqi 

qi  qi 

npip]iy  -  epiqd       npipj 


=  ^"Yiq!  + 
+  "^PipAqi  -  Pi)' 


qi 


(qi  -pi)iqi  -pi) 


V  J,  (-"i  -  "Pif("j  -  "Pi)^      V  n         ^r  M 

2  ^ -^i^rz =  2  [^i/'i  +  (qi  -  Pi)^ 

i  ^j  "  PiPi  i  ¥=j 


+  2 


>/!  -  ^Piqi)     (qi  -  Pi)(qi  -  Pi) 


~jL      "qi 


"qi 
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3 


3 


IE' 


^  Pj  =  I  -  Pi  =  qi,       2  ^<Ji  -  Po)  =  (^  -  l)^i  -qi  =(k  -  l)qi. 

1  -  epiqi       (k  -  2)(qi  -  Pi) 


Hence 

Xrii  -  np,)\nj  -  npjf 


i¥=j 


n'^PiPi 


=  Z 


3piqi  +ik  -  2)qi  + 


=  ^lPiqi  +(^  -1X^-2)  +- 


k-62piqi-(k-2f 


Appendix  C.     Order  of  Convergence  and  Convergence  in  Probability 

If 

lim  n°^f{n) 

W— >-oo 

is  bounded,  we  say  that/(«)  is  at  most  of  order  l/w"  and  write 

If 

lim  nj(n)  =  0, 

n—i-co 

we  say  that/(«)  is  of  lower  order  than  l/«°'  and  write 


A  sequence  of  random  variables  u^,  u^-  •  •  converges  in  probability  to  0  if, 
for  every  €  >  0,  we  have 

lim  PKI«nl  >  e)  =  0. 

(rtj  —  npj)^ 
Lemma  C.l.     If  la.  >  fi,  then  ~ converges  in  probability  to  0  as  n 

becomes  infinite. 

Proof.     Using  some  simple  manipulations  and  the  Tchebycheff  inequality,  we 


have 


PH^^i-=^>.|=P. 


>  el/^«(°'/«-l       < 


Piqi 


„^2/^„(2a/^)-l 


so  that  convergence  in  probability  occurs  if 

2a 


i.e.,  if  2a  >  /3. 

Lemma  C.l.    IfloL  >  ^,  then 


-  1  >0, 


l«i  -  npi 


converges  in  probability  to  Q  as  n  becomes  infinite. 
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Proof.     Since 


^''•(l 

\i=i 


>  A  <Pr\  dii  least  one  of  ^- — ^  >  - 


^  ^2<x-fi^2lP  .2.    „"2a//3 


the  result  follows  if  2  a  >  ^. 
H  Appendix  D.     Evaluation  of  EU^ 

The  first  three  terms  of  the  approximation  to  t/„  given  by  (4)  will  be  the  basis  for 
the  approximation  we  use,  i.e., 

nUn    _  1   Y  ("i  -  ^Pif  _1  y  ("^  -  np^f        \_  ^  {nj  -  npiY 
I  logae^lj^i         npi  6,Ci      (n/?i)^  12  ^^1      (npif 

Let  us  define  («2  —  np^y 

Then,  excluding  terms  that  will  yield  moments  Ein^  —  np^)^  where  j  ^  7,  we  have 
nUr, 


—  ^tI    2    ^1-2   +     2    >^A2W'i2| 

logge/        4\ifi  i^;,  / 


We  now  evaluate  the  necessary  expected  values.  Inasmuch  as  we  wish  to  retain  only 
terms  of  0{\\n)  or  lower  we  shall  drop  the  terms  of  higher  order  as  they  appear 
indicating  their  omission  by  "  •    •  ". 

ini-npif       Wplql  +  np,(j^^-epiqd  qJAj-ep^q^ 

Ewi„  =  E  — - — -^ —  = 5-0 =  iqi  H , 

^2  (np.f  n^p?  npi 


(rti  -  «/7,)5       XQnYiqKqi  -Pi)  +  npiqlqi-pdi\-}}Md 
EWi^Wis  =  E — ; — -^ —  =  ^-3 

'  '         (mf  ""Pi 

_  lOqfiqi  -  Pi)       qiiqt  -  Pi)(\  -  \2piqd 

—  ~r  0    9  5 

npi  n% 

(ni  -  npif 
Ewf^  =  Ewi^Wi^  =  E—^—^ 

ISnYiql  +  5n^p?qf[5  -  26/7,^,]  +  np^qjU  -  30/7,^,.  +  UOpfqf] 

n'pt 
_15^       5^1(5  -  lep^qd       ql\  -  iOp^qj  +  nOpfqf) 
~  np,    +  n-p}  +  nY, 
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Now  since  the  first  term  of,  say, 


it  follows  that  in  obtaining  the  expected  values  of  the  terms  of  (6)  we  can  ignore  all 
terms  that  are  0(l/«^)  where  d  ^2.  We  shall  do  this  in  the  following  evaluations. 
From  Lemma  B.3  we  have 

E[{nh  -  npnf  \  ni\  =  -^  («i  -  np,Y ^ («j  -  np,)  + 


qh  qt  qi 


so  that 


nyiPhqt  nyipuqi 


+  2 2—  ^("i  -  "Pi) 


-T^  V>nYiq1  +  npiqlX  -  epiq,)-] 
P^Hi 


and  hence, 


— i— 2-  [«/'^•9^•(^^    "  /'.O]    +        „„  ^         «/'^^^ 


£^  Z    ^A2W'z2   =  3/?^^^•    H V  (K    -  Ijq^. 

h  n  n 


Also, 


+        ^3^   „2,  ^("t    -  "Pi)' 


'^  PiPhq  I  "  PiPh  qi 

nPhiRi  -Ph) 
n^^iPiPh 

_  iqi  -Pn)         2   2^2  +  .  .  .] 

+     „2„2^      npiqlqi  -  p>), 

where  "  ••  •  "  stands  for  quantities  that  will  yield  terms  Oiljn^)  or  higher.    Hence, 

I0p^(qi-pi)       3(qi-p^)       (qi  -  Pn)iqi  -  Pi) 
"■^   '^  n  n  npi 
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and 


lOqiiqi  -pi)       3ik  -  2)qi       (k  -  2)qi{qi  -  pd 

E  z  "^m^i^  = ;; + 

h  n  n  npi 


Also, 


Pl 


Phiqi  -Ph) 


Ew^^Wii  =    4  3^     o  E{ni  -  npif  -      .3       ,    E(ni  -  npd 


'^'''n^plPuq' 


nPiPnqi 


+  r^ Eirii  -  npiY 


,  „  „  [l5n-pU  +  •••]-  %:^  [■■■]+  %J;^  \ln-pU  +  •  •  •] 


Ph 


nyiqi 


f^pm 


ny-qi 


Hence, 


Finally, 


n  npi 


\5q1       3(k  -  l)q\ 

h  "  Wi 


Ew,,->Wi^  =  E 


=  E 


i^h   -  Whfi^i   -  Wi^^ 


n'pM 

(Hi  -  npif 
n'pip! 

qt 

3 

„4.  [..,=,?,?.  ■■■! 

^5phPi 

and 


First,  we  note  that 


and  that 


Ewf^  =  3qt  + 


npi 


1  -  6piqi  -{k  -2)  (qi  -  Pi) 
E  Z  ^m^i2  =  ^Piqi  +ik  -  2)qi  + : 

h 
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-   k 


k  k 

=  32  ^1  +  3(^  -  1)  -  32  ^1  +  (A:  -  2)(A:  -  1) 


1 

+  - 


2  -  -  62^1  +  ^  -  6(/c  -  1)  +  62  9l  -  (A:  -  2)^ 

;=1  ^i  i=l  i  =  l 


=  {k  -  \)ik  +  1)  + 


ir  *   1 

-    2 k  +k  -6{k  -\)  -{k  -If 

n\_i  =  iPi 


=  k^  -\  + 


n\_i=iP 
Similarly,  for  the  second  term, 


k     I 

2 k^  -2k  +2 


-  k 

2  ^12^13   +    2  ^m^iS 


k   r 


=  2 


1=1 


I0qi(qi-pi)       ^^qlqi-Pi)       Kk-2)qi    _   (k  -  2)qlqi  -  p,) 


npi 


+ 


+ 


npi 


_y    lOqiiqi  -pi)  -(k  -  2)qi{4pi  -  qj) 


^1  (,^B),|-(4.+2),.,,  ^1  ii,,^s,_3    (,^„, 

i=  1  "/'f  i=l    nPi 


For  the  third  term, 


2  M^fa  +  2  ^hz^i^ 
U'  =  1 


_  y  /15^  _  ^^ptq, 

iiAnpi  n 


=  2  -f-(^f  -/'D  =2  —  (^^  -p^)■ 

i  =  l  fiPi  i  =  l   ^/'i 


For  the  final  term, 


r  k 


U=i 


+  2 


^ii  ■  ^ii   +  Z^h2-  ^ii 


ijth 


=  1 


\5qf    ,    I5q!    ,    3(k  -  2)ql 


L  "Pi         « 


f^Pi 


k     0^2 


^-      ^^2 


=  2  ^'  [%  +  5a  +  Kk  -  2)]  =2  ^(3^  -  !)■ 

i  =  l  Wi  1  =  1  "/'i 


GEORGE  A.   MILLER  AND  WILLIAM  G.   MADOW 


467 


E 

logo  ej 

f 

k^ 

— 

1 

+ 

1 

= 

4 

4/7 

k  I 

y  —  k^  -2k  +  2 

ii=iPi 


(k  +8)  ^  ^ 
6«      1=1 /?i 


+  5 


A:  +2 


15 


15 


fC         rfi 


6« 


(/c  -  1)  +  —  2  ^  -  —  (^  - 1)  + 

36//  4  =  1  /7j        36/2 


3(3A:  -  1)  1^  ^ 
12//       j  =  i  /?,• 


yc2-l         1^1         2  -2k  ~k^       A:+8^1 

+  7-„  2  r  + 77 z-  I 


4  4«i'^i/?j 


4/2 


6«     i=i/7i 


A:  +  8  5(A:  +  2){k  -  1)         5 

6/2  6/2  1 2// 


+  _£_  l^^l    ,  3(3^-1)      ^1 

12/2iri/?i  12/2  ^/7/ 


Finally, 


/2t/„  \2     A:2  -  1         1    /  4^    1  V 
£L-^    =— ^  +Pr    I   -   (3  -2^-16+9^+2) 
logae/  4  12//\,fi/?  ' 


+  [6   -ek   -3k^   +  2^2   +   igyr.   ^  jQyr,2 

12/2 


+  lOA:  -  20  -  5A:  +  5  +  2  +  5A:  -  18A:] 
A:2  -  1       7^-11^    1        9A:2-20A:+7 

+  -^- — 2  r- 


12/2      i  =  iPi 


12/2 


As  a  check,  E\ —     was  computed  for  a  special  case.   Let  k  =2so  that 

\log2  ej 

n-i^  +  n^  =  n,         n^  =  n  —  ii^, 

n2  —  np2  =  n  —  til  —  np^  =  -("i  —  "jPi). 


Hence, 


nU^    _    l(/2i  -np^fl  1    ^2 
log2  e  ^  2  /2  l/^i       ^1 


l(/2i    -/2/;i)7   1  1 

6         //^ 


Z'? 


1  (/2i  -/2/7i)V  1  _  r 

+  12  /23  l;,3       ^3 
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1  («!  -  np^'^        1   («i  -  np-^y 


4      nY^q\ 


+ 


+ 


iqi-pxf- 


1  (/7i   -  /7/?i) 


"Tl^l 


36      77*/7^^f      '^'      ^^'        6      "3-3 
1  («i  -  np^) 


j—(.qi  -pi) 


£.^!^f _...2„2.2 


12       «ViM 


4:j-(^i+/'i)- 


+ 


+ 


l5nY,ql{cjl+p'^ 
nnYq\ 


6nY^q\ 


^  3  ^  1  -  6/^1^1  ^  5{q^-p^f  _  5(q,  -  p,f  ^  5(ql  +  pD 

4  4/7/7i^i  12/7/7i^i  3«/7i^l  4«/7i^l 

3  1 

4  12«/7i^i 


[3  -  \Sp,q,  +  5iq,  - p,f  -  20{q^  - p^f  +  15(9^  +  p% 


Let/?!  =  ^1  =  1/2,  so  that 


^3        1  /        9       15 
log,e/  ^4  ^  3^V  ~  2  "^T 


3        1  /12  -  18  +  15 
""4  ^  3^ 


3  3 

4  An ' 


If  k  =  2  and  Pi  =  qi  =  1/2, 


logs  e 


3       1/         6 
4+«ll-4 
+  0 
+  0 
5 
4n 


(from  square  of  1st  term) 

(from  square  of  2nd  term) 
(from  product  of  1st  by  2nd) 

(from  product  of  1st  by  3rd). 


From  general  formula 
1st  term  squared 

2nd  term  squared 

product  of  1st  by  2nd 

product  of  1st  by  3rd  term 
Hence  each  part  checks. 


4+4^^4- 

1     15/    1/4 
36    ITr  1/2 


3        1 


15 


15 


15 


36n       36«       36/7 


=  0 


10  5x4 

6«  6n 

3x5/1       1\   _   15    _  5 
12«   \2  ^2/   ^\2n^4n 
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A  STATISTICAL  DESCRIPTION  OF  VERBAL  LEARNING* 
George  A.  Miller  and  William  J.  McGill 

MASSACHUSETTS    INSTITUTE    OF   TECHNOLOGY 

Free-recall  verbal  learning  is  analyzed  in  terms  of  a  probability  model. 
The  general  theory  assumes  that  the  probability  of  recalling  a  word  on  any 
trial  is  completely  determined  by  the  number  of  times  the  word  has  been 
recalled  on  previous  trials.  Three  particular  cases  of  this  general  theory  are 
examined.  In  these  three  cases,  specific  restrictions  are  placed  upon  the 
relation  between  probability  of  recall  and  number  of  previous  recalls.  The 
application  of  these  special  cases  to  typical  experimental  data  is  illustrated. 
An  interpretation  of  the  model  in  terms  of  set  theory  is  suggested  but  is  not 
essential  to  the  argument. 

The  verbal  learning  considered  in  this  paper  is  the  kind  observed  in  the 
following  experiment:  A  list  of  words  is  presented  to  the  learner.  At  the 
end  of  the  presentation  he  writes  down  all  the  words  he  can  remember.  This 
procedure  is  repeated  through  a  series  of  n  trials.  At  the  present  time  we  are 
not  prepared  to  extend  the  statistical  theory  to  a  wider  range  of  experimental 
procedures. 

The  General  Model 

We  shall  assume  that  the  degree  to  which  any  word  in  the  test  material 
has  been  learned  is  completely  specii&ed  by  the  number  of  times  the  word  has 
been  recalled  on  preceding  trials.  In  other  words,  the  probability  that  a 
word  will  be  recalled  on  trial  n  +  1  is  a  function  of  k,  the  number  of  times 
it  has  been  recalled  previously.  (Symbols  and  their  meanings  are  hsted  in 
Appendix  C  at  the  end  of  the  paper.) 

Let  the  probability  of  recall  after  k  previous  recalls  be  symbolized  by 
Tk  .  Then  the  corresponding  probability  of  failing  to  recall  the  word  is 
1  —  Tk  .  When  a  word  has  been  recalled  exactly  k  times  on  the  preceding 
trials,  we  shall  say  that  the  word  is  in  state  Ak  .  Thus  before  the  first  trial 
all  the  words  are  in  state  Aq  ;  that  is  to  say,  they  have  been  recalled  zero 
times  on  previous  trials.  Ideally,  on  the  first  trial  a  proportion  tq  of  these 
words  is  recalled  and  so  passes  from  state  Aq  to  state  Ai  ,  The  proportion 
1—  To  is  not  recalled  and  so  remains  in  state  ^o  •    On  the  second  trial  the 

*Thi8  research  was  facilitated  by  the  authors'  membership  in  the  Inter-University 
Summer  seminar  of  the  Social  Science  Research  Council,  entitled  Mathematical  Models 
for  Behavior  Theory,  held  at  Tufts  College,  June  28-August  24,  1951.  The  authors  are 
especially  grateful  to  Dr.  F.  Mosteller  for  advice  and  criticism  that  proved  helpful  on 
many  different  occasions. 

This  article  appeared  in  Psychometrika,  1952,  17,  369-396.    Reprinted  with  permission. 
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words  that  remained  in  Aq  undergo  the  same  transformation  as  before.  Of 
those  in  ^1 ,  however,  the  proportion  1  —  xi  is  not  recalled  and  so  remains 
in  Ai  . 

One  general  problem  is  to  determine  the  proportion  of  words  expected 
in  state  Ak  on  trial  n.  Let  p(Ak,n)  represent  the  probabihty  that  a  word  is 
in  state  Ak  on  trial  n.  Since  these  are  probabilities,  they  must  sum  to  unity 
on  any  given  trial: 

k 

The  number  of  trials  and  the  total  number  of  times  a  word  has  been  recalled 
must  assume  non-negative,  integral  values.  We  assume  that  a  word  can  be 
recalled  only  once  per  trial  at  most,  so  the  number  of  recalls  cannot  exceed 
the  number  of  trials.    Therefore,  we  have 

p{Ak  ,n)=0        for        k  <  0,  n  <  0,  n  <  k. 

We  also  assume  that  none  of  the  words  can  have  been  recalled  before  the 
first  trial,  so  for  n  =  0, 

r  A       n\         U         for         k  =  0, 

lo        for        k  ^  0. 

For  all  trials  we  have  the  difference  equation: 

p(Ak  ,  n  +  1)  =  p(Ak  ,  n)(l  -  n)  +  p(Ak-i  ,  n)rt-i  .  (1) 

This  equation  reflects  the  fact  that  a  word  can  get  into  state  Ak  on  trial 
n  +  1  in  only  two  ways:  (a)  either  it  is  in  A^  on  trial  n  and  is  not  recalled 
on  trial  n  +  1,  or  (b)  it  is  in  ^i_i  on  trial  n  and  is  recalled  on  trial  n  +  1. 

The  following  rationalization  for  this  scheme  is  in  the  spirit  of  the  statisti- 
cal theories  of  learning  developed  by  Bush  and  Hosteller  (1)  and  by  Estes 
(3).  The  rationalization  is  not  necessary  for  the  development  of  the  math- 
ematics, but  it  gives  an  alternative  way  of  thinking  about  the  present  model 
and  helps  to  clarify  its  relation  to  the  earlier  theories.  On  the  first  pre- 
sentation of  the  list  of  words  a  random  sample  of  stimulus  elements  is  con- 
ditioned to  the  appropriate  response  for  each  word.  The  measure  of  this 
set  of  conditioned  elements  is  tq  .  (The  total  measure  of  the  set  of  all  stimulus 
elements  for  a  given  word  is  assumed  to  be  unity,  so  the  measure  can  be 
regarded  as  a  probability.)  If  a  word  is  not  recalled,  the  measure  of  con- 
ditioned elements  for  that  word  is  unchanged.  But  if  a  word  is  recalled,  the 
proportion  of  conditioned  elements  is  increased.  The  effect  of  recalling  a 
word  is  to  take  another  random  sample  of  elements  from  the  total  set  and  to 
condition  them.  The  proportion  of  elements  conditioned  when  a  word  in 
state  Ak  is  recalled  is  Tk+i  —  t*  .  More  precise  interpretation  of  this  set- 
theoretical  argument  will  be  presented  when  we  consider  the  special  cases  of 
the  general  theory. 
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The  general  solution  of  (1)  when  all  the  t^  are  different  is  (see  Appendix  A) : 
p{Aq  ,  n)  =  (1  —  To)",        for        A;  =  0, 

p(A,  ,  n)  =  ToTi  • .  •  T,_i  E      [^  ~  ^■^"      ,        for  A;  >  0.  (2) 

•"°    n  (r,-  -  r.) 

The  denominator  of  each  of  the  fractions  in  the  summation  includes  all 
differences  of  the  form  (r,  —  r.)  except  for  the  zero  difference  (t,-  —  t,). 

The  expected  number  of  times  a  word  is  recalled,  all  told,  up  to  and 
including  trial  n,  is,  by  definition, 

E(k,n)  =  i2kpUk,n).  (3) 

A-O 

The  expected  proportion  of  words  recalled  on  trial  n  +  1  is  the  difference, 
E{k,  n  -i-  1)  —  E(k,  n),  between  the  cumulative  values  on  successive  trials. 
This  difference  is  the  theoretical  recall  score  and  we  symbolize  it  by  p„+i  . 
Thus  we  have  the  general  relation 

Po  =  0,  forn  =  0, 

p„+i  =  E{k,  n+\)  -  E{k,  n),        forn  +  1  >  0.  (4) 

An  alternative  expression  for  p„+i  can  be  obtained  as  follows.  On  trial 
n  the  probability  that  a  word  is  in  state  A^'is  piA^  ,  n).  The  probability  of 
recall  in  state  A^  is  r^ .  The  product  r*  •  p{Ak ,  n)  is,  therefore,  the  probability 
that  a  word  will  both  be  in  ^4^  on  trial  n  and  also  be  recalled  on  trial  n  +  1. 
If  these  joint  probabilities  are  summed  over  all  the  states  -4.^  from  k  =  0 
to  k  =  n,  we  have  the  total  probability  that  a  word  will  be  recalled  on  trial 
n  +  1.    That  is  to  say,  we  have  p„+i  : 

n 

Pn+i  =  S  np(At ,  n),  (5) 

The  two  expressions  (4)  and  (5)  are  equivalent,  which  can  be  shown  as 
follows.    From  (3)  and  (4)  together  we  have 

n  +  l  n 

Pn+1  =  2  kpi^k  ,  n  +  1)  -  J2  kp(At  ,  n). 

*-0  t-0 

The  first  summation  on  the  right  can  be  rewritten  by  substituting  for  p(At  , 
n+l)  according  to  (1): 

n-H  n  +  l  ti  +  1 

22  kp(A,  ,  n  +  1)  =   2  kpiAk  ,  n)(l  -  r^)  +  2  ^(^*-i  ,  n)Tt-i 

i-O  A-0  t-0 

n  n 

=   2  kp{Ak  ,n)  -  J2  kp{Ak  ,  n)n 

k-O  t-0 

+  Z(^+  l)p(A*,n)r.  . 


GEORGE  A.   MILLER  AND   WILLIAM  J.   MCGILL  473 

When  this  result  is  substituted  into  the  expression  for  p„+i  ,  we  have 
Pn+i  =  -  J2  kp(Ak  ,n)Tk+  J2  (k  -{-  l)p(Ak  ,  n)Tk 

=   H  TkpiAk  ,  n), 

k  =  0 

which  is  the  desired  result. 

The  asymptotic  behavior  of  the  model  as  n  increases  without  limit  can 
be  deduced  from  the  general  solution  (2).  First  consider  the  case  in  which 
one  or  more  of  the  transitional  probabilities  r^  is  zero.  All  the  words  start 
in  state  Ao  and  have  a  positive  probability  of  moving  along  to  states  A^  , 
A2  ,  etc.,  up  to  the  first  state,  A^  ,  with  zero  transitional  probability,  r^  =  0. 
There  the  words  are  trapped;  eventually  all  the  words  are  recalled  exactly 
h  times  and  cannot  be  recalled  again.  This  fact  can  be  seen  from  (2):  If 
Ti  >  0,  then  all  the  terms  (1  —  r.)"  in  (2)  go  to  zero  asn—^02.  Thus  p(Ak ,  n) 
goes  to  zero  for  k  <  h.  For  k  >  h,  the  product  in  front  of  the  summation 
must  include  r^  =  0,  and  so  p(Ak  ,  n)  =  0  for  k  >  h.  When  k  =  h,  however, 
(1  —  ThY  =  (1  —  0)"  =  1,  and  so  this  term  in  the  summation  of  (2)  does  not 
go  to  zero.    Instead,  when  ta  =  0  and  r^  >  0  for  i  <  h, 

lim  p(Ah  ,  n)  = w  '^°^' — n    ""'"V T  =  1- 

n^co  (To    —     Th){Ti     —     Th)     '  '  "    (Ta_i     —     Th) 

The  recall  score,  p„+i  ,  then  approaches  zero  as  an  asymptote;  from  (5), 
lim  p„+i  =   X)  Tfc[lim  p{Ak  ,  n)]  =  0, 

n-»oo  ft  =  0  n->m 

since  the  probability  at  the  asymptote  is  concentrated  at  state  Ah  ,  and  for 
this  state  th  =  0.  This  case  is  of  little  interest  for  an  acquisition  theory, 
since  the  asymptote  of  the  learning  curve  is  at  zero.  Therefore,  in  what 
follows,  we  shall  be  concerned  only  with  the  case  in  which  all  the  r^  are 
different  and  greater  than  zero. 

If  all  the  transitional  probabilities  r^  are  greater  than  zero,  then  from 
(2)  we  see  that  as  n  approaches  infinity  all  the  terms  in  the  summation  go 
toward  zero  for  all  finite  values  of  k.  Consequently  the  sum  of  the  p{A^  ,  n) 
can  be  made  as  near  zero  as  we  please  for  any  finite  k  by  selecting  a  large 
enough  value  of  n.  In  the  limit,  therefore,  the  probability  of  any  finite 
number  of  recalls  is  zero.  Since  the  sum  of  the  piAk  ,  n)  must  equal  unity, 
almost  all  the  probability  comes  to  be  concentrated  in  state  A^,  and  we  have 
for  the  limit  when  all  t^  >  0, 

p{A^  ,   oo)   =   1. 
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We  are  now  able  to  show  that  a  word  in  state  Ak  has  probabihty  one  of 
moving  to  state  Ak+i  ,  if  the  learning  process  is  continued  indefinitely.  This 
happens  because  almost  all  words  eventually  reach  state  A^  .  Thus  we  can 
write,  for  the  probability  of  leaving  state  A^  on  some  trial, 

00 

2]  Tkp(Ak  ,n)  =  1, 

n=k 

or, 

2  p(Ak  ,n)  =  —        for  Tk  >  0. 

n  =  k  Tk 

In  all  the  cases  we  shall  consider  in  this  paper  the  value  of  t^  will  approach 
an  asymptote  as  A;  — >oo.  We  are  interested  in  placing  the  following  restric- 
tions on  the  Tk'. 

Tk    ^    Tj   , 
Tk    >    0, 

lim  Tk  =  m  <  1. 

k-'co 

The  first  two  conditions  insure  that  p(Ak  ,  n)  goes  toward  zero  for  finite  k 
and  large  n.  The  third  condition  provides  the  asymptotic  value  of  r^  for 
infinite  k.  In  the  summation  for  the  limiting  value  of  p„+i  ,  all  terms  are  zero 
out  to  infinity,  and  so  we  have 

lim  p„+i  =  mp{Ao.  ,  ^)  =  m.  (5') 

n-»oo 

In  other  words,  if  we  assume  that  m  is  the  asymptotic  value  of  7^  as  /c  ^>  oo , 
then  m  is  also  the  asymptotic  value  of  p„+i  as  n  -^oo , 

In  the  special  cases  discussed  below,  a  restriction  is  placed  upon  the 
value  of  Tk  in  the  form  of  the  linear  difference  equation,* 

Tk+i  =  a  +  (XTk  ,  (6) 

where  0  <  a  <  1  and  0  <  a  <  1  —  a.  The  limits  for  a  have  been  chosen  so 
that  Tk+i  is  bounded  between  zero  and  one  and,  since  we  are  interested  in 
acquisition,  so  that  r^+i  >  t^  . 

Consider  the  following  development  of  (5) : 

n+'l 

Pn  +  2    =     2    TkP(Ak    ,n-\-    I), 
k  =  0 

*We  have  tried  to  observe  the  convention  that  parameters  are  represented  by  Greek 
letters  and  statistical  estimates  are  represented  by  Roman  letters.  In  the  case  of  a  and 
m,  however,  we  have  violated  this  convention  in  order  to  make  our  symbols  coincide  with 
those  used  by  other  workers.  The  symbols  m,  o,  a,  and  p  were  originally  proposed  by 
Bush  and  Mosteller. 
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Now  substitute  for  p(Ak  ,  n  +  1)  according  to  (1): 

n+l  n+1 

Pn+2  =  Z)  np(^i  ,  n)(l  -  Ti)  +  Z)  TtP(^*-i  ,  n)n-i 

n  n 

=  Pn+1  —  S  np(Ak  ,  n)  +  Z  Ti+iTtpCAi  ,  n). 

«:  =  0  4  =  0 

Next  we  substitute  for  t^+i  according  to  (6): 

Pn+3  =  Pn+1  -  J2  rlp(Ak  ,n)+  S  (a  +  a7-0Ttp(Ai  ,  n) 

=  (1  +  a)p„+i  -  (1  -  a)  S  7-iP(^A  ,  w) 

=  (1  +  a)p„,x  -  (1  -  a)E(T^  ,  n  +  1),  (7) 

where  £'(7*  ,  n  +  1)  is  the  second  raw  moment  of  the  r*  (as  p„+i  is  the  first 
raw  moment)  for  trial  n+l. 

Restriction  (6)  brings  the  system  into  direct  correspondence  with  a 
special  case  of  the  theory  developed  by  Bush  and  Hosteller.  In  their  termi- 
nology, an  operator  Qi  is  applied  to  the  probability  of  response,  p,  to  give 
«i  +  ocip  as  the  new  probability  whenever  a  trial  is  successful.  A  second 
operator  Q2  is  applied  to  give  aa  +  aaP  whenever  a  trial  is  unsuccessful.  In 
the  present  application  of  this  more  general  theory,  Qi  is  preserved  intact  by 
restriction  (6),  but  Q2  is  assumed  to  be  the  identity  operator.  That  is  to 
say,  az  is  zero  and  az  is  unity,  so  Q2P  =  p.  In  the  present  application,  an 
unsuccessful  trial  consists  of  the  omission  of  the  word  during  recall.  It 
seems  reasonable  to  assume  that  the  non-occurrence  of  a  word  has  no  effect 
upon  its  probability  of  occurrence  on  the  next  trial.  How  successful  this 
simple  assumption  is  will  be  seen  when  we  examine  the  data. 

Analysis  of  the  Data 

At  the  end  of  the  experiment  the  experimenter  has  collected  a  set  of 
word  lists — the  words  recalled  by  the  learner  on  successive  trials.  These 
recall  lists  will  usually  contain  a  small  number  of  words  that  did  not  occur 
in  the  presentation.  These  spontaneous  additions  by  the  learner  are  of  some 
interest  in  themselves,  but  we  shall  ignore  them  in  the  present  discussion. 

We  would  like  to  use  the  data  contained  in  the  word  lists  to  obtain  an 
estimate  of  p„+i  in  (5).  We  shall  refer  to  the  estimate  as  r„+i  .  There  are, 
we  suppose,  N  words  provided  by  the  experimenter  as  learning  material  in 
the  experiment.  It  seems  reasonable  to  assume  that  under  certain  con- 
ditions these  words  are  homogeneous.  By  this  we  imply  that  the  responses 
to  all  of  the  words  in  state  A^  may  be  considered  as  estimates  of  the  same 
transitional  probability  of  recall,  t^  . 
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We  can  then  define  a  convenient  statistic, 

J  n  AT 

^n  +  l     =    l~f    2^     2^   Xi^k.n  +  1    •  (8) 

iV      ic^o     i^i 

The  numbers,  Xi^^.n+i  ,  are  either  zero  or  one.  The  subscripts  k  and  n  +  1 
have  the  same  meaning  that  we  have  attached  to  them  previously.  They 
indicate  that  we  are  looking  at  an  event  that  occurs  on  trial  n  +  1  to  a  word 
in  state  Aj,  .  The  first  summation  is  carried  out  over  i,  the  experimental 
words,  with  k  fixed  to  show  that  we  count  the  number  of  words  in  each  state. 
The  rules  that  determine  whether  an  Xi^k.n+i  is  zero  or  one  are  straight- 
forward. The  X,-,i.„+i  are  zero  for  all  words  not  in  state  Ak  when  summing 
on  i.  They  are  zero  for  any  word  in  state  A^ ,  if  a  recall  fails  to  occur  on  trial 
w  +  1.  Lastly  the  X,,i,„+i  are  1  for  any  word  in  state  Ak  ,  provided  that  a 
recall  occurs  on  trial  n  +  1.  The  second  summation  extends  over  k,  the 
various  states.  This  summation  goes  only  up  to  n  because  our  reference 
point  for  determining  the  number  of  states  is  trial  n.  These  rules  determine 
r„+i  as  the  proportion  of  correct  responses  to  the  A^  experimental  words  on 
trial  n  +  1. 

To  show  that  r„+i  is  unbiased  we  observe  that 


E(r..O  =  i  ±  [e{  t  Z., 


The  expectation  of  any  Xi^^.n+i  in  state  A^  is  t^  .  Thus  the  expectation  of 
the  sum  in  the  brackets  is  N-Tk-p{Ak  ,  n).  Substituting  this  into  the  ex~ 
pression  for  £'(r„+i),  we  find 

n 

E(r„+i)  =  X  Tkp{Ak  ,  n), 

*  =  0 

E(r„+0  =  p„+i  .  (9) 

The  sampling  variance  of  r„+i  around  p„+i  is  determined  by  the  variances 
of  the  various  X,,i,„+i  around  the  transitional  probabilities,  t*  . 

Var  (r„^i)  =  ^2  Z  Var  ^  J2  ^Y,. *,„+,]. 

The  variance  of  any  X,,a,„+i  in  state  Ak'is  binomial  and  is  given  by  ri(l  —  r*). 
The  variance  of  Xl^-i  ^..it.n+i  thus  becomes  A^  p{Ak ,  n)Tk  (1  —  n).  Substitut- 
ing this  into  the  expression  for  Var  (r„+i),  we  obtain 

1      " 
Var  (r„+j)  =  t;  S  Pi^k  ,  w)n(l  -  n).  (10) 

It  should  be  noted  that  this  variance  is  never  larger  than  the  binomial  variance 

"t^    Pn  +  l    •    (1     —    Pn+l), 
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since  the  binomial  variance  includes  in  addition  to  (10)  a  term  that  depends 
on  the  variance  of  the  r^  around  p„+i  , 


Var  (r„,0  =  "^'^^^    ^^'^  "  M  S  ''^^^^  '  ""^  ~  ""^j- 


(100 


In  order  to  apply  the  general  theory  we  must  obtain  estimates  of  the 
transitional  probabilities,  Tk  .  Now  r^  is  the  probability  of  moving  from 
state  Ak  to  Ak+i  and  is  assumed  to  be  constant  from  trial  to  trial.  After 
trial  n  a  certain  number  of  words,  Nk,n  ,  are  in  state  Aj,  .  Of  these  N^.n  words, 
some  go  on  to  Aft +1  and  some  remain  in  ^1^  on  trial  n  -\-  1.  The  fraction  that 
moves  up  to  Ak+i  provides  an  estimate  of  r^  on  that  trial.  Therefore,  on 
every  trial  we  obtain  an  estimate  of  r^  .    Call  these  estimates  tk,n+i  ■    Then 

N 

If  A^fc,„  is  zero,  no  estimate  is  possible. 

Next  we  wish  to  combine  the  4,n+i  to  obtain  a  single  estimate,  h  ,  of 
the  transitional  probability,  Tk  .  The  least-squares  solution,  obtained  by 
minimizing  (4,„+i  —  r^)^,  is  the  direct  average  of  the  4,7.+i  •  This  estimate 
is  unbiased,  but  it  has  too  large  a  variance  because  it  places  undue  emphasis 
upon  the  4.„+i  that  are  based  on  small  values  of  A^,,,„ .  We  prefer,  therefore, 
to  use  the  maximum-likelihood  estimate, 

which  respects  the  accuracy  of  the  various  4.n+i  • 

For  example,  after  trial  7  there  may  be  10  words  in  state  A3  .  Of  these 
10,  6  are  recalled  on  trial  8.  This  gives  the  estimate  ^3,8  =  6/10.  Every 
trial  on  which  i^g.^  5^  0  provides  a  similar  estimate,  ^s.n+i  •  The  final  estimate 
of  T3  is  obtained  by  weighting  each  of  these  separate  estimates  according  to 
the  size  of  the  sample  on  which  it  is  based  and  then  averaging.  This  pro- 
cedure is  repeated  for  all  the  t^  individually  as  far  as  the  data  permit. 

The  4.„+i  are  also  useful  to  check  the  basic  assumption  that  r^  is  in- 
dependent of  n.  If  the  4,„+i  show  a  significant  trend,  this  basic  assumption 
is  violated. 

The  Simplest  Case:    One  Parameter 

The  computation  of  p(Ak  ,  n)  from  (2)  for  the  general  case  is  exceedingly 
tedious  as  n  and  k  become  moderately  large.  We  look,  therefore,  for  a  simple 
relation  among  the  r*  of  the  form  of  restriction  (6).    The  first  case  that  we 
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shall  consider  is 

To  =  a, 


Tk 


,1  -  a  +  (1  -  d)Tk  .  (12) 


In  this  form  the  model  contains  only  the  single  parameter,  a.    The  solution 
of  the  difference  equation  (12)  is 


Tk 


=  1  -  (1  -  ay.  (13) 


The  interpretation  of  (13)  in  set-theoretical  terms  runs  as  follows:  On 
the  first  presentation  of  the  list  a  random  sample  of  elements  is  conditioned 
for  each  word.  The  measure  of  this  sample  is  a,  and  it  represents  the  prob- 
ability, To  ,  of  going  from  state  A(,  to  state  Ai  .  If  a  word  is  not  recalled,  no 
change  is  produced  in  the  proportion  of  conditioned  elements.  When  a 
word  is  recalled,  however,  the  effect  is  to  condition  another  random  sample 
of  elements,  drawn  independently  of  the  first  sample,  of  measure  a  to  that 
word.  Since  some  of  the  elements  sampled  at  recall  will  have  been  previously 
conditioned,  after  one  recall  we  have  (because  of  our  assumption  of  inde- 
pendence between  successive  samples) : 

/Elements  conditioned\        /Elements  conditioned\        /  Common  \ 
\  during  presentation  /        \     during  the  recall     /        \  elements  / 

=  a  +  a  —  a^  =  l  —  (1—  a)^. 

This  quantity  gives  us  the  transitional  probability  ti  of  going  from  Ai  to 
A2  ,  from  the  first  to  the  second  recall.  The  second  time  a  word  is  recalled 
another  independent  random  sample  of  measure  a  is  drawn  and  conditioned, 
so  we  have 

^2  =  [1  -  (1  -  a)']  +  a  -  a[l  -  (1  -  a)']  =  1  -  (1  -  a)'. 

Continuing  in  this  way  generates  the  relation  (13), 

With  this  substitution  the  general  difference  equation  (1)  becomes 

p(Ak  ,  n  -fl)  =  p(Ak  ,  n)(l  -  a)'^'  +  p(Ak-^  ,  n)[l  -  (1  -  a)']. 

The  solution  of  this  difference  equation  can  be  obtained  by  the  general  method 
outlined  in  Appendix  A  or  by  the  appropriate  substitution  for  t*  in  (2). 
The  solution  is 

p(Ao  ,  n)  =  (1  -  a)", 

p(Ak  ,  n)  =  (1  -  ar'  n  [1  -  (1  -  a)"-].  (14) 

1  =  0 

From  definition  (5)  it  is  possible  to  obtain  the  following  recursive  ex- 
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pression  for  the  recall  on  trial  n  +  1  (see  Appendix  B): 

p„.:  =  a  +  (l  -  a)[l  -  (1  -  ay]p^. 
The  variance  of  the  recall  score,  r„+i  ,  is 


(15) 


Var  (r„+,)  =  —  (p„+2  -  p„^i), 


(16) 


In  order  to  illustrate  the  application  of  these  equations,  we  have  taken 
the  data  from  one  subject  in  an  experiment  by  J.  S.  Bruner  and  C.  Zimmerman 
(unpubhshed).  In  their  experiment  a  Hst  of  64  monosyllabic  English  words 
was  read  aloud  to  the  subject.  At  the  end  of  each  reading  the  subject  wrote 
all  of  the  words  he  could  remember.  The  order  of  the  words  was  scrambled 
before  each  reading.    A  total  of  32  presentations  of  the  list  was  given. 

From  the  detailed  analysis  of  the  estimates  of  tu  derived  from  this 
subject's  data  it  was  determined  that  a  value  of  a  =  0.22  would  provide  a 
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Figure  1 

Comparison  of  Theoretical  and  Observed  Values  of  p„  for  the  One-Parameter  Case.    Dotted 
line  is  drawn  ±  one  standard  deviation  from  p„  . 

good  fit.  In  Figure  1  the  values  of  p„+i  computed  from  (15)  are  given  by  the 
solid  function.  The  data  are  shown  by  the  open  circles.  The  dotted  lines 
are  drawn  ±  one  standard  deviation  from  p„+i  as  computed  from  the  variance 
in  (16).  The  single  parameter  gives  a  reasonably  adequate  description  of 
these  data,  at  least  through  the  first  20  trials.  From  the  20th  trial  on,  how- 
ever, it  seems  that  the  subject  "forgets  as  fast  as  he  learns."    He  seems  to 
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reach  an  asymptote  somewhat  below  the  theoretical  value  at  unity.  The 
introduction  of  an  asymptote  less  than  unity  will  be  discussed  in  connection 
with  the  three-parameter  case. 


0=0.22 


TRIAL     NUMBER 

Figure  2 

Comparison  of  Theoretical  and  Observed  Values  of  piAt  ,  n)  for  the  One-Parameter  Case 

As  a  further  check  on  the  correspondence  of  theory  and  data,  Figure  2 
shows  the  piredicted  and  observed  values  of  p(Aa  ,  n)  as  a  function  of  n,  for 
k  =  0,1,  2,  3. 

Second  Case:    Two  Parameters. 

In  the  one-parameter  form  of  the  theory  it  is  assumed  that  the  propor- 
tion of  elements  sampled  during  the  presentation  of  the  list  is  the  same  as 
the  proportion  sampled  during  each  recall.  Most  data  are  not  adequately 
described  by  such  a  simple  model.  At  the  very  least,  then,  it  is  necessary  to 
consider  the  situation  when  these  two  sampling  constants  are  different.  In 
order  to  introduce  the  second  parameter,  we  phrase  restriction  (6)  in  the 
following  form: 


To    =    Po   , 

Tt+i  =  a  -h  (1  —  a)Tk  , 


(17) 
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where  po  is  the  proportion  of  elements  conditioned  during  the  presentation. 
The  solution  of  this  difference  equation  can  be  written 

T.  =  1  -  (1  -  po)(l  -  a)\  (18) 

On  the  first  presentation  of  the  list  a  random  sample  of  measure  po  is 
conditioned  to  every  word.  When  a  word  is  recalled,  a  random  sample  of 
measure  a  is  drawn  and  conditioned.  After  one  recall,  therefore,  the  measure 
of  conditioned  elements  is 

Ti  =  Po  +  a  —  apo  =  I  -  (l  -  po)(l  -  a). 

After  two  recalls  the  measure  of  conditioned  elements  is 

T2  =  [1  -  (1  -  Po)(l  -  d)]  +  a-  a[l  -  (1  -  po)(l  -  a)] 

=  1  -  (1  -  po)(l  -  a)^ 

Continuing  in  this  way  generates  the  relation  (18). 

With  this  substitution  the  general  difference  equation  (1)  becomes 

p{Ak  ,n  +  1)  =  p{A^  ,  n)(l  -  po)(l  -  a)" 

+  K^.-i  ,  n)[l  -  (1  -  po)(l  -  a)'-'].        (19) 

The  solution  of  (19)  is 

p(Ao  ,  n)  =  (1  -  poT, 

p{Ak  ,n)  =  {l-po)       11  ^ -7T  ^^">' 

.=0  1  —  (1  —  a) 

When  Po  =  a,  (20)  reduces  to  (14). 

The  recursive  form  for  the  recall  now  becomes  (see  Appendix  B) 

Pn.i  =  Po  +  (1  -  Po)[l  -  (1  -  ar]pr.  .  (21) 

The  variance  of  r„+i  is 

Var  (r„+i)  =  -vr  (p„+2  —  Pn+i)-  (22) 

In  order  to  illustrate  the  application  of  these  equations  we  have  selected 
two  sets  of  data.  The  first  set  was  collected  by  Bruner  and  Zimmerman.  A 
Ust  of  32  monosyllabic  words  was  read  aloud.  At  the  end  of  each  reading  the 
subject  wrote  all  of  the  words  he  could  remember.  The  order  of  the  words 
was  scrambled  before  every  reading.  A  total  of  32  presentations  of  the  fist 
was  given. 

From  the  analysis  of  the  tk  calculated  for  this  particular  subject  it  was 
found  that  a  =  0.10  and  po  =  0.27  gave  a  good  description  of  the  data.  In 
Figure  3  the  values  of  p„+i  computed  from  (21)  are  shown  by  the  solid  func- 
tion.   The  data  are  given  by  the  open  circles.    The  dotted  lines  are  drawn 
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±  one  standard  deviation  from  p„+i  as  computed  from  (22).  As  a  further 
check,  Figure  4  shows  the  predicted  and  observed  values  of  p(Ak  ,  n)  as  a, 
function  of  n  for  A;  =  0,  1,  2,  3. 

The  distribution  of  cumulative  recalls  on  any  given  trial  provides  still 
another  way  of  viewing  the  data.  In  Figure  5,  the  cumulative  distribution 
of  k,  the  number  of  recalls,  is  shown  for  trials  5,  10,  15,  20.  The  proportion 
of  test  words  recalled  k  times  or  less  is  plotted  for  comparison  on  each  trial. 

The  second  set  of  data  was  collected  by  M.  Levine.  He  read  aloud  a 
100-word  anecdote.  At  the  end  of  the  reading,  the  subject  wrote  down  all 
he  could  remember.  Four  such  trials  were  given.  The  order  of  the  words 
was  not  scrambled  during  the  interval  between  trials. 

From  the  analysis  of  the  data  for  this  particular  subject  it  was  found 
that  a  =  0.87  and  po  =  0.61  gave  a  good  description  of  the  results.  Figure  6 
shows  the  comparison  of  theory  and  experiment  both  for  p„+i  and  for  p{Ak  ,n) 
for  A:  =  0,  1,  2. 

As  a  general  observation,  we  have  noted  that  when  the  order  of  the  words 
is  not  scrambled  between  trials,  the  parameter  a  is  relatively  large.  This 
is  to  say,  when  the  words  are  not  scrambled,  there  is  a  much  higher  probability 
that  the  same  words  will  be  recalled  on  successive  trials.  This  effect  is  related 
to  the  serial-position  curve.  The  subject  recalls  words  at  the  beginning  and 
at  the  end  of  the  list.  If  these  words  remain  in  their  favored  positions,  they 
continue  to  be  recalled.  New  words  are  added  to  those  recalled  at  the  ends 
at  a  rate  determined  by  po  ,  so  the  learning  works  from  the  two  ends  toward 
the  middle,  which  is  the  last  to  be  learned.  This  effect  has  been  noted  with 
lists  of  randomly  selected  English  words  as  well  as  with  anecdotes. 

Third  Case:    Three  Parameters 

In  the  one-  and  two-parameter  cases  we  have  assumed  that  after  sufficient 
practice  the  subject  should  eventually  reach  perfect  performance.  Some  data, 
however,  seem  to  evade  this  simple  assumption  and  so  it  is  necessary  to  con- 
sider what  happens  when  a  lower  asymptote  is  introduced.  Such  a  parameter 
may  be  necessary  when,  for  example,  the  period  of  time  allowed  for  recall  is 
limited. 

To  introduce  the  third  parameter  we  adopt  the  general  restriction  (6) 

To    =    Po    , 

Tk+i  =  a  -\-  aTk  ,         where         0<a<l— a<l.  (23) 

The  solution  of  (23)  can  be  written 

r.  =  j^—  -  (y^-  -  Po)a\  (24) 

1  —  a        \1  —  a  / 

When  a  =  1  —a,  (24)  reduces  to  (18).  From  (24)  we  see  that  as  k  increases 
without  limit,  t^  approaches  a/(l  —a)  as  an  asymptote.    From  (5')  we  know 
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Figure  3 
Comparison  of  Theoretical  and  Observed  Values  of  p„  for  a  Two-Parameter  Case.    Dotted 
line  is  drawn  ±  one  standard  deviation  from  p„  , 
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Figure  4 
Comparison  of  Theoretical  and  Observed  Values  of  p(^*  ,  n)  for  a  Two-Parameter  Case 
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cumulative  number  of  recalls,  k 
Figure  5 

Comparison  of  Theoretical  and  Observed  Distribution  of  Recalls  on  Four  Different  Trials  in 

a  Two-Parameter  Case 
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Figure  6 


Comparison  of  Theoretical  and  Observed  Values  of  p„  and  p(4*  ,  n)  for  a  Two-Parameter 

Case 

that  Tk  and  p„+i  approach  the  same  asymptotic  value,  m.    So  we  have  the 
equation 

a 


lim  p„+i  =  m  = 


1 


(25) 


Since  1  —  a  >  a,  m  cannot  exceed  unity;  and  since  both  a  >  0  and  1  —  a 
>  0,  m  cannot  be  negative.  In  general,  we  are  interested  in  cases  where 
m  >  po  ,  for  if  po  >  w,  we  obtain  forgetting  rather  than  acquisition. 
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A  set-theoretical  rationalization  for  (24)  runs  as  follows.  On  the  pre- 
sentation of  the  material  a  random  sample  of  elements  of  measure  po  is 
conditioned  for  every  word.  At  the  first  recall  a  sample  of  measure  1  —  a 
is  drawn.  Of  these  elements,  a  portion  of  measure  a  is  conditioned  and  the 
remainder,  1  —  a  —  a,  are  extinguished.  We  add  the  conditioned  elements 
as  before,  but  now  we  must  subtract  the  measure  of  the  elements  conditioned 
during  presentation  and  extinguished  during  recall,  i.e.,  {1  —  a  —  a)  po.  Thus 
we  have 

Ti  =  Po  +  a  —  apo  —  {I  —  a  —  d)po 

=  m  —  (m  —  Po)(x. 

At  the  second  recall  the  same  sampling  procedure  is  repeated: 

T2    =     Tl    ~\~    O,    ~    0,T\    —    (1     —    a    —    a)  Ti 

=  a  +  aTi  =  m  —  (m  —  po)a  - 

Continuing  in  this  way  generates  the  relation  (24). 

When  (24)  is  substituted  into  (1),  we  obtain  the  appropriate  difference 
equation,  but  its  solution  for  the  three-parameter  case  is  hardly  less  cumber- 
some than  (2).  It  would  appear  that  the  simplest  way  to  work  with  these 
equations  is  to  take  advantage  of  our  solution  of  the  two-parameter  case. 

First,  we  introduce  a  new  transitional  probability,  ri  ,  such  that 

ri  =  Tk/m 

=  1  -  (1  -  Po/m)a,         forpo  <  m.'  (26) 

This  new  variable  is  now  the  same  as  in  the  case  of  two  parameters  given  in 
(18),  Math  substitution  of  po/m  for  po  and  a  for  (1  —  a).  Therefore,  from 
(2)  and  (20),  we  know  that 


ToT]     •   •   •     Ta_ 


A-l    2^ 


(1   -   tQ" 

1=0 

\  ml        to  1  -  a^' 

=  p'iAk  ,  n). 

When  m  ri  is  substituted  into  (2),  the  factor  m'  in  the  product  in  front 
of  the  summation  cancels  the  factor  m''  in  the  denominator  under  the  summa- 
tion.    Thus  we  know  that 

p(A,  ,  n)  =  T'or[  ■  ■  •  tL.  Z      1^  ~  ''^"      ,  (28) 


j=0 
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which  is  the  same  as  p\Ak  ,  n)  in  (27)  except  for  the  numerator  under  the 
summation.    This  numerator  can  be  written 

(1  -  T,)"  =  [(1  -  m)  +  m(l  -  T^T 

=  (1  -  my  +  n(l  -  m)""'w(l  -  tQ 

+  (2)(1  -  my-Va  -  rd'  +  .  • .  m"(l  -  tQ".         (29) 

Now  we  substitute  this  sequence  for  the  numerator  in  (28)  and  sum  term  by 
term.    When  we  consider  the  last  term  of  this  sequence  we  have 

,     ,  /  V        ^   (1     ~     '^i) 

TqTi    '  •  •     Tk-1     Z-J  J.  , 

'"°    n  (r;  -  r!) 

j=0 

which  we  know  from  (27)  is  equal  to  m"p'  (A^  ,  n).  The  next  to  last  term 
gives 

,      A  n(l  -  m)"-'(l  -  r[r' 

n  (r;  -  rO 

1=0 

which  we  know  from  (27)  is  equal  to  n{\  —  m)w"~^  p'(^-t  ,  n  —  1).  Proceed- 
ing in  this  manner  brings  us  eventually  to  the  case  where  n  <  k,  and  then 
we  know  the  term  is  zero.    Consequently,  we  can  write 

p{Ak  ,  n)  =  ni'p'{Ak  ,  n)  +  n(l  —  m)m'~^p'{Ak  ,  n  —  1)  +  •  •  • 

+  [n-  k}^^  ~  "^y'^^'ViA,  ,  k) 

=   Z  Wm''(l  -  ^)"~y(^^  ,  -i)-  (30) 

When  the  asymptote  is  unity  (m  =  1),  (29)  and  (30)  reduce  to  the  two- 
parameter  case. 

We  recall  that  because  of  the  way  in  which  our  probabilities  were  de- 
fined in  (1),  (30)  can  be  written  as 

viA,  ,n)  =   Z  ('')m\l  -  my-y(A,  ,  i). 
i=o  \1/ 
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Now  it  is  not  difficult  to  find  an  expression  for  p„+i  in  terms  of  the  p-  computed 
in  the  two-parameter  case: 

Pn+i  =   Z)  TkPiAk  ,  n) 
=  w  XI  T'kpiAk  ,  n) 

i  =  0 

=  w  2  X)  (^•)^'(1  -  my'np'iAk  ,  i). 
If  we  invert  the  order  of  summation,  we  find  that 

,=.0    \1/  Ar  =  0 

=  m  E  Wm'(l  -  m)"-p:,x  .  (31) 


1/ 

The  computation  of  p„+i  by  this  method  involves  two  steps:  first,  the  values 
of  p^+i  are  calculated  as  in  the  two-parameter  case  with  the  substitution 
indicated  in  (26);  second,  these  values  of  p^+i  are  weighted  by  the  binomial 
expansion  of  [m  +  (1  —  m)]"  and  then  summed  according  to  (31). 

These  computations  can  be  abbreviated  somewhat  by  using  an  approxi- 
mation developed  by  Bush  and  Hosteller  (personal  communication).    It  is 

p„,2  =  (2  +  a  +  2a«)p„,i  -  W{\  -  a)  +  (1  +  a)(l  +  2aa)]p„ 

+  3(1  -  a'){l  -  a)pl  -  2(1  -  a)(l  -  a')pl  -  3(1  -  a')p.p^,,  , 

(n  >  1).         (32) 

The  approximation  involves  permitting  the  third  moment  of  the  distribution 
of  the  Tk  around  p„  to  go  to  zero  on  every  trial. 

The  variance  of  r„+i  in  the  three-parameter  case  is 

Var  (r„+i)  =  ^  [p„+2  -  (a  +  a)p„+i].  (33) 

This  expression  for  the  variance  of  r„+i  follows  directly  from  (7)  and  (10'). 
It  is  easily  seen  that  (10')  can  be  written  as  follows: 

E  rl  p(A,  ,  n)  =  p„.i  -  N  Var  (r„,0.  (34) 

*  =  0 

Substituting  (34)  in  (7)  and  solving  for  Var  (r„+i)  we  find  that 
Var  (r„+i)  =  ^^  _    ^  [p„+2  -  (a  +  «)p™+i], 

which,  except  for  notation,  is  (33).  The  one-parameter  and  two-parameter 
variances  (16)  and  (22)  are  special  cases  of  this  expression. 
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It  is  of  interest  to  observe  that  when  the  limiting  value,  m,  is  substituted 
in  (33)  for  p„+2  and  p„+i  ,  the  limiting  variance  is  found  to  be  binomial.  That 
is, 

m(l  —  m) 

N 


lim  Var  (r„+i)  = 


This  reflects  the  fact,  established  earlier  in  (5'),  that  as  n  grows  very  large 
the  variance  of  the  t^  around  m  goes  to  zero. 

In  order  to  obtain  a  numerical  example,  we  have  taken  the  data  from 
another  subject  in  the  experiment  by  Bruner  and  Zimmerman.  Sixty-four 
monosyllabic  English  words  were  read  aloud  and  the  order  of  the  words  was 
scrambled  before  every  presentation.  A  visual  inspection  of  the  data  led  us 
to  choose  an  asymptote  in  the  neighborhood  of  0.7.  This  asymptote  is  drawn 
on  the  plot  of  the  4  in  Figure  7  and  on  the  plot  of  the  r„  in  Figure  8.    Then  we 
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CUMULATIVE   NUMBER    OF    RECALLS,   k 

Figure  7 
Transitional  Probability  of  Recall,  ta  ,  as  a  Function  of  Number  of  Recalls  in  the  Three- 


Parameter  Case. 


Values  of  tk  are  indicated  by  open  circles.    The  curve  fitted  to 
the  tk  is  Tk  =  0.7  -  0.57  (0.83)*. 


estimated  po  =  0.13  by  considering  all  the  trials  on  which  words  were  in 
state  Ao  and  calculating  po  as  the  weighted  average  of  the  to,n+i  for  all  those 
trials.  Next  we  estimated  the  sampling  parameter  a  =  0.83.  This  was 
done  by  obtaining  the  estimates,  4  ,  for  successive  values  of  k;  these  estimates, 
together  with  (24),  give  us  a  set  of  equations  estimating  a.  We  used  the 
weighted  average  of  these  estimates  (ignoring  negative  values).  Then  we 
obtained  a  =  0.12  from  the  equation  a  =  m(l  —  a).  We  shall  comment  on 
the  estimation  problems  later. 
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Figure  8 
Comparison  of  Theoretical  and  Observed  Values  of  pn  for  Three-Parameter  Case 
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Figure  9 
Comparison  of  Theoretical  and  Observed  Values  of  pC^A  ,  n)  for  Three-Parameter  Case 
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When  these  parameter  values  were  substituted  into  (24)  we  obtained  the 
function  for  Tk  shown  in  Figure  7.  When  the  values  were  substituted  into  (28) 
for  A;  =  1,  2,  3,  4,  we  obtained  the  functions  for  p(Ak  ,  n)  shown  in  Figure  9. 
When  they  were  substituted  into  (31)  we  obtained  the  function,  for  p„  shown 
in  Figure  8.  In  Figure  8  the  dotted  lines  are  drawn  ±  one  standard  deviation 
from  p,  ,  as  computed  from  (33) . 

A  comparison  of  the  values  of  p„  computed  from  (31)  and  from  (32)  is 
given  for  the  first  eighteen  trials  in  Table  1.  With  this  choice  of  parameters 
the  Bush-Mosteller  approximation  seems  highly  satisfactory. 

TABLE  1 
Comparison  of  Exact  and  Approximate  Values  of  p„  for  First  18  Trials 


Trial 

Exact 

Approximate 

Trial 

Exact 

Approximate 

1 

.1300 

.1300 

10 

.2663 

.2655 

2 

.1426 

.1426 

11 

.2837 

.2827 

3 

.1559 

.1559 

12 

.3014 

.3000 

4 

.1700 

.1700 

13 

.3191 

.3174 

5 

.1847 

.1846 

14 

.3369 

.3347 

6 

.2000 

.1999 

15 

.3546 

.3520 

7 

.2159 

.2157 

16 

.3722 

.3692 

8 

.2323 

.2319 

17 

.3896 

.3862 

9 

.2491 

.2486 

18 

.4067 

.4030 

Discussion 

In  the  preceding  pages  we  have  made  the  explicit  assumption  that  the 
several  words  being  memorized  simultaneously  are  independent,  that  memor- 
izing one  word  does  not  affect  the  probability  of  recalling  another  word  on  the 
list.  The  assumption  can  be  justified  only  by  its  mathematical  convenience, 
because  the  data  uniformly  contradict  it.  The  learner's  introspective  report 
is  that  groups  of  words  go  together  to  form  associated  clusters,  and  this 
impression  is  supported  in  the  data  by  the  fact  that  many  pairs  of  words 
are  recalled  together  or  omitted  together  on  successive  trials.  If  the  theory 
is  used  to  describe  the  behavior  of  50  rats,  independence  is  a  reasonable 
assumption.  But  when  the  theory  describes  the  behavior  of  50  words  in  a 
list  that  a  single  subject  must  learn,  independence  is  not  a  reasonable  as- 
sumption. It  is  important,  therefore,  to  examine  the  consequences  of  intro- 
ducing covariance. 

The  difference  between  the  independent  and  the  dependent  versions  of 
the  theory  can  best  be  illustrated  in  terms  of  the  set-theoretical  interpretation 
of  the  two-parameter  case.  Imagine  that  we  have  a  large  ledger  with  1000 
pages.    The  presentation  of  the  list  is  equivalent  to  writing  each  of  the  words 
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at  random  on  100  pages.  Thus  po  =  100/1000  =  0.1.  Now  we  select  a  page 
at  random.  On  this  page  we  find  written  the  words  A,  B,  and  C.  These 
are  responses  on  the  first  trial.  The  rule  is  that  each  of  these  words  must 
be  written  on  50  pages  selected  at  random.  Thus  a  =  50/1000  =  0.05.  With 
the  independent  model  we  would  first  select  50  pages  at  random  and  make 
sure  that  word  A  was  written  on  all  of  them,  then  select  50  more  pages  in- 
dependently for  B,  and  50  more  for  C.  With  a  dependent  model,  however, 
we  could  simply  make  one  selection  of  50  pages  at  random  and  write  all  three 
words.  A,  B,  and  C,  on  the  same  sample  of  50  pages.  Then  whenever  A  was 
recalled  again  it  would  be  likely  that  B  and  C  would  also  be  recalled  at  the 
same  time. 

The  probability  that  a  word  will  be  recalled  depends  upon  the  measure 
of  the  elements  conditioned  to  it  (the  number  of  pages  in  the  ledger  on  which 
it  is  inscribed)  and  does  not  depend  upon  what  other  words  are  written  on  the 
same  pages.  Therefore,  the  introduction  of  covariance  in  this  way  does  not 
change  the  theoretical  recall,  p„+i  .  The  only  effect  is  to  increase  the  variance 
of  the  estimates  of  p„+i  .  In  other  words,  it  is  not  surprising  that  the  equa- 
tions give  a  fair  description  of  the  recall  scores  even  though  no  attention 
was  paid  to  the  probabilities  of  joint  occurrences  of  pairs  of  words.  Associa- 
tive clustering  should  affect  the  variability,  not  the  rate,  of  memorization. 

The  parameters  a,  po  ,  and  a  obtained  from  the  linear  difference  equa- 
tion (6),  are  assumed  to  describe  each  word  in  the  list.  Thus  data  from 
different  words  may  be  combined  to  estimate  the  various  r*  .  If  the  para- 
meters vary  from  word  to  word,  p„+i  is  only  an  approximation  of  the  mean 
probabihty  of  recall  determined  by  averaging  the  recall  probabilities  of  all 
the  words.  Similarly,  the  expressions  given  for  p„+i  cannot  be  expected  to 
describe  the  result  of  averaging  several  subjects'  data  together  unless  all 
subjects  are  known  to  have  the  same  values  of  the  parameters. 

The  general  theory,  of  course,  is  not  limited  to  linear  restrictions  of 
the  form  of  (6).  The  data  or  the  theory  may  force  us  to  consider  more  com- 
plicated functions  for  xt  .  For  all  such  cases  the  general  solution  (2)  is 
applicable,  though  tedious  to  use,  and  will  enable  us  to  compute  the  necessary 
values  oi  p{Ak  ,  n). 

Once  a  descriptive  model  of  this  sort  has  been  used  to  tease  out  the 
necessary  parameters,  the  next  step  is  to  vary  the  experimental  conditions 
and  to  observe  the  effects  upon  these  parameters.  In  order  to  take  this  next 
step,  however,  we  need  efficient  methods  of  estimating  the  parameters  from 
the  data.  As  yet  we  have  found  no  satisfactory  answers  to  the  estimation 
problem. 

There  is  a  sizeable  amount  of  computation  involved  in  determining  the 
functions  p(Ak  ,  n)  and  p„  .  If  a  poor  choice  of  the  parameters  a,  po  ,  and 
a  is  made  at  the  outset,  it  takes  several  hours  to  discover  the  fact.  In  the 
example  in  the  preceding  section,  we  estimated  the  parameters  successively 
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and  used  different  parts  of  the  data  for  the  different  estimates.  After  p„ 
had  been  computed  it  seemed  to  us  that  our  estimates  of  po  and  m  were  both 
too  low.  Clearly,  the  method  we  have  used  to  fit  the  theory  to  the  data  is 
not  a  particularly  good  one.  We  have  considered  least  squares  in  order  to 
use  all  of  the  data  to  estimate  all  parameters  simultaneously.  We  convinced 
ourselves  that  the  problem  was  beyond  our  abihties.  Consequently,  we  must 
leave  the  estimation  problem  with  the  pious  hope  that  it  will  appeal  to  some- 
one with  the  mathematical  competence  to  solve  it. 

Appendix  A 
Solution  for  p(Ak  ,  n)  in  the  General  Case 

The  solution  of  equation  (1)  with  the  boundary  conditions  we  have 
enumerated  has  been  obtained  several  times  in  the  past  (4,  5).  We  present 
below  our  own  method  of  solution  because  the  procedures  involved  may  be 
of  interest  in  other  applications. 

Equation  (1)  may  be  written  explicitly  as  follows: 

(1  -  To)p(Ao  ,  n)  =  p{Ao  ,  n  +  1) 
Top(Ao  ,n)  +  (l-  Ti)p(Ai  ,  n)  =  p(Ai  ,  n  +  1) 
Tip(Ai  ,  7i)  +  (1  -  T2)p(A2  ,  n)  =  p(^2  ,  n  +  1) 


This  system  of  equations  can  be  written  in  matrix  notation  as  follows: 


1    -  To      0  0  0 

To  1   -  T,      0  0 

0  Ti  1    -  T2       0 

0  0  72  1    -     T3 


p{Ao  ,  n) 
p{A,  ,  n) 
P(^2  ,  n) 
p(^3  ,  n) 


p{Ao  ,  n  +  1) 
p(i4i  ,  n  +  1) 
p{A2  ,n  +  \) 
p{A^  ,  n  4-  1) 


This  infinite  matrix  of  transitional  probabilities  we  shall  call  T,  and  the 
infinite  column  vectors  made  up  of  the  state  probabilities  on  trial  n  and 
n  +  1  we  shall  call  d„  and  d„+i  .    So  we  can  write 

The  initial  distribution  of  state  probabilities,  do  ,  is  the  infinite  column  vector 
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{ 1,  0,  0,  0,  •  •  •  } .    The  state  probabilities  on  trial  one  are  then  given  by 

Tdo  =  rfi  . 
The  state  probabilities  on  trial  two  are  given  by 

Tdy  =  d,  , 
so  by  substitution, 

Td^  =  T{Tdo)  =  T'do  =  d2  . 
Continuing  this  procedure  gives  the  general  relation 

rdo  =  rf„  . 

Therefore,  the  problem  of  determining  d„  can  be  equated  to  the  problem  of 
determining  T". 

Since  7"  is  a  semi-matrix,  we  know  that  it  can  be  expressed  as 

T  =  SDS-\ 

where  D  is  an  infinite  diagonal  matrix  with  the  same  elements  on  its  diagonal 
as  are  on  the  main  diagonal  of  T  (e.g.,  2).  The  diagonal  elements  of  S  are 
arbitrary,  so  we  let  Sa  =  1.    Now  we  can  write 


1        0       0 

S,,     1        0 

Ti 


Now  it  is  a  simple  matter  to  solve  for  >S,,  term  by  term.    For  example,  to 
solve  for  >S2i  we  construct  (from  row  2  and  column  1)  the  equation 

To   +    (1    —    Ti)»S2i    =    Szxil    —    To), 

which  gives 

S21    =     To/(ti    —    To). 

To  solve  for  S31  ,  we  use  the  equation 

T1S21    +    (1    —    T2)*S3i    =    *S3i(1    —    To) 

S31    =     Ti*S2i/(t2    ~    To) 

=     ToTi/(ti    —    To)(t2    —    To). 


TS  =  SD 

'1 

0       0-" 

'1     - 

To 

0 

0 

•' 

'   ^   < 

S21 

*J31 

1        0     • 

^32         1          • 

=  •= 

0 
0 

1  - 

0 

-    Ti 

0 

1  - 

T2         • 
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Proceeding  in  this  manner  gives  the  necessary  elements  of  S,  and  we  have 
1  0  0  0     •• 

To 


s  = 


(ri- 

-    To) 
ToTi 

(r,  - 

-  To)(r2  — 

To) 

TqT 

T2 

1 

0 

0 

Tl 

1 

0 

(t2    —     Ti) 

T1T2 

T2 

-   1 

(ti    —     To)(t2    —     To)(t3    —     To)        (tz    —     Ti)(t3    —    Tj)        (ts    —     T2) 


The  elements  of  S~^  can  be  obtained  term  by  term  from  the  equation 
*S»S~^  =  1.  For  example,  the  element  aSzi  of  S~^  is  given  by  row  two  of  S 
times  column  one  of  *S~^  :  to/(ti  —  tq)  +  S21  =  0.  Continuing  in  this  way 
we  have 


s- 


1 

0 

0 

0 

To 

1 

0 

0 

(tq    —    Ti) 

ToTi 

Tl 

1 

0 

(to    —     T2)(ti    — 

T2) 

(t. 

— 

T2) 

TqT] 

T2 

T1T2 

T2 

1 

(to    —     T3)(ri     —     T3)(t2    —     T3)         (ti     —     T3)(r2    "     T3)         (t2    —     T3) 


These  matrices  permit  a  simple  representation  of  the  powers  of  the 
matrix  T.,   Thus, 

T'  =  (SDS-'XSDS-')  =  SD{S''S)DS-'  =  SD'S-\ 

and  in  general, 

r  =  SD^S-'. 

Since  Z)  is  a  diagonal  matrix,  D"  is  obtained  by  taking  the  nth  power  of  every 
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diagonal  element.    When  this  equation  for  T"  is  multiplied  through,  we  obtain 

rpn    

(1    -     To)"  0  0 

[(1     —     To)"  (1     -     Ti)"1  ,  .„  _ 

L(Ti    —    To)  (To    —    Ti)J 

(1    -    To)" 


To' 


+ 


(ti    —     To)(t2    — 

To) 

(1  -   tO" 

(to  —  Ti)(r2  — 

Tl) 

(1  -  r,r 

+      (1  -  r,r     1     fd  -  tQ"  _^  (1  -  r,y'\  ^^  _  ^^y 

(to    —     T2)(ti    —     T2)J  X(t2    —     Ti)  (tj     —     Tj)  J 


Since  ^"do  involves  only  the  first  column  of  T",  it  is  not  actually  necessary 
to  obtain  more  than  the  first  columns  of  8~^  and  of  T".  We  have  presented 
the  complete  solution  here,  however.  It  can  be  seen  from  inspection  of  the 
first  column  of  7""  that  (2)  is  the  general  solution: 

p(Ao,n)  =  (1  -  To)",  forA;  =  0, 

V{A,  ,  n)  =  ToT,  . . .  T._,  i:      i^  ~  ''••^"      ,        for  A;  >  0.  (2) 

""    n  (t,-  -  T.) 


J-O 


This  general  method  of  solution  can  be  used  for  the  special  cases  con- 
sidered in  this  paper,  with  the  substitution  of  the  appropriate  values  for  rt  . 

Appendix  B 
Recursive  Expression  for  p„+i  in  Two-Parameter  Case 
From  (20)  we  obtain  the  recursive  relation 

V(A.,,  ,  «  +  1)  =  [l-(l-..)(l-a)111-a-ar-l  ^^^    „, 

1  -  (1  -  a) 
Rearranging  and  summing,  we  have 

=   Z[l  -(1  -Po)(l  -ay]p(A,,n). 
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The  right  side  of  this  equation  is,  from  (5)  and  (18),  p„+,  .    The  left  side  can 
be  rewritten 

k=i  LI  —  (1  —  a)  J 

which  becomes  on  trial  n  (with  n  >  1), 

g  [1  I  [I  I  a|.  ?('!».»)]  =  ''•.  • 
We  now  have,  by  adding  and  subtracting  p(Ao  ,  n), 

1         r    "  "  ~l 

1 n ^      2Z  Pi^ic  ,n)-  23  (1  -  a) V(^*  ,n)  \  =  pn  , 

^  \^     ~    O,)       L     ft-O  A-O  J 

1  -  Z  (1  -  «)V(^* ,  n)  =  [1  -  (1  -  ay]pn . 

k  =  0 

Now  we  know  that 

Pn  +  l     =     1     -    (1     -    Po)     Z    (1     -    «)V(^*    ,   ^); 
A  =  0 

and  so  we  obtain 

P„.i  =  1  -  (1  -Po){l  -  [1  -  (1  -a)"]p„}. 
Rearranging  terms  gives 

Pn.i  =  po  +  (1  -  Po)[l  -  (1  -  a)"]p„  ,  (21) 

which  is  the  desired  result. 

From  this  result  (15)  is  obtained  directly  by  equating  po  and  a. 

Appendix  C 

List  of  Symbols  and  Their  Meanings 

a  parameter. 

Ak  state  that  a  word  is  in  after  being  recalled  k  times. 

a  parameter. 

d„  infinite  column  vector,  having  p(Ak  ,  n)  as  its  elements. 

D  infinite  diagonal  matrix  similar  to  T. 

k  number  of  times  a  word  has  been  recalled. 

m  asymptotic  value  of  t^  and  p„  . 

n  number  of  trial. 

N  total  number  of  test  words  to  be  learned. 

Nk.n  number  of  words  in  state  A,,  on  trial  n. 

Po  probability  of  recalling  a  word  in  state  Aq  . 
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p(Ak  ,n)  probability  that  a  word  will  be  in  state  Ak  on  trial  n. 

r„  observed  recall  score  on  trial  n;  estimate  of  p„  . 

p„  probability  of  recall  on  trial  n. 

Sij  elements  of  *S. 

S'ij  elements  of  aS~\ 

S  infinite  matrix  used  to  transform  T  into  a  similar  diagonal  matrix. 

tk  estimate  of  Xk  ■ 

U,n  observed  fraction  of  words  in  state  Ak  that  are  recalled  on  trial  n. 

Tk  probability  of  recalling  a  word  in  state  Ak  . 

T  infinite  matrix  of  transition  probabilities  r^  . 

Var  (r„)  variance  of  the  estimate  of  p„  . 

■X^i.it.n+i  random  variable  equal  to  1  or  0. 
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ULTIMATE  CHOICE  BETWEEN  TWO  ATTRACTIVE  GOALS: 
PREDICTIONS  FROM  A  MODEL* 
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A  mathematical  model  for  two-choice  behavior  in  situations  where  both 
choices  are  desirable  is  discussed.  According  to  the  model,  one  or  the  other 
choice  is  ultimately  preferred,  and  a  functional  equation  is  given  for  the  frac- 
tion of  the  population  ultimately  preferring  a  ^ven  choice.  The  solution 
depends  upon  the  learning  rates  and  upon  the  initial  probabilities  of  the 
choices.  Several  techniques  for  approximating  the  solution  of  this  functional 
equation  are  described.  One  of  these  leads  to  an  explicit  formula  that  gives 
good  accuracy.  This  solution  can  be  generalized  to  the  two-armed  bandit 
problem  with  partial  reinforcement  in  each  arm,  or  the  equivalent  T-maze 
problem.  Another  suggests  good  ways  to  program  the  calculations  for  a  high- 
speed computer. 

The  immobility  of  Buridan's  ass,  who  starved  to  death  between  two 
haystacks,  has  always  seemed  unreasonable.  No  doubt  the  story  was  invented 
to  mock  an  equihbrium  theory  of  behavior.  One  expects  that  any  such 
equilibrium  in  approach-approach  situations  will  be  unstable — one  of  the 
attractive  goals  will  be  chosen.  In  this  paper  some  properties  that  flow  from 
a  mathematical  model  for  repetitive  approach-approach  behavior  are  dis- 
cussed. In  the  model  for  behavior  in  these  choice  situations,  an  organism 
initially  shifts  its  choices  from  one  to  another,  but  after  a  while  settles  upon 
a  single  choice. 

Thus  in  the  early  part  of  the  learning  the  theoretical  organism  may  give 
some  expression  to  the  notion  of  an  equilibrium  by  making  different  choices 
on  different  trials,  but  eventually  even  this  behavior  vanishes  for  the  single 
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organism.  On  the  other  hand,  some  organisms  may  ultimately  choose  one 
goal  and  others  another,  so  that  a  notion  of  equilibrium  or  balance  could  be 
recaptured  across  a  population  of  organisms.  The  quantitative  aspects  of  a 
model  for  such  behavior  are  investigated.  The  model  employed  is  one  dis- 
cussed by  Bush  and  Hosteller  [1]. 

A  simple  situation  will  be  discussed  first,  then  the  mathematical  problem 
encountered  there  will  be  related  to  the  more  complicated  two-armed  bandit 
problem  with  partial  reinforcement  on  each  arm.  Suppose  that  on  each 
trial  of  an  infinite  sequence  an  organism  may  respond  (or  choose)  in  one 
of  two  ways.  For  purposes  of  exposition,  specify  the  ways  as  R  and  L  (for 
right  and  left,  say),  so  that  for  concreteness  one  can  think  of  a  rat  choosing 
the  left-hand  or  right-hand  side  in  a  T-maze,  or  a  person  choosing  the  left- 
hand  or  the  right-hand  button  in  a  two-armed  bandit  situation.  However, 
R  and  L  are  intended  to  stand  for  a  general  pair  of  attractive  objects  or 
responses,  mutually  exclusive  and  exhaustive,  which  lead  to  attractive 
goals. 

Suppose  that  on  a  given  trial  the  probability  of  choosing  R  is  p,  and 
that  of  choosing  L  is  1  —  p,  where  as  usual  0  <  p  <  1.  If  i2  is  chosen,  then  the 
probability  of  choosing  R  next  time  is  increased  to  aip  -|-  1  —  ai  ,  but  if  L 
is  chosen  the  probability  of  choosing  R  next  is  reduced  to  a2p,  where 
0  <  ui  <  1,0<q;2^  1-  The  point  is  that  when  a  reinforcing  choice  is  made, 
that  choice  has  an  increased  probability  of  being  chosen  next  time,  and 
both  R  and  L  are  regarded  as  reinforcing.  The  asymmetry  in  the  formulas 
comes  from  the  fact  that  the  notation  uses  the  probability  of  choosing  R, 
and  not  the  probability  of  choosing  the  particular  side  chosen  on  each  trial. 
The  operators  used  to  change  the  probabilities  are  discussed  by  Bush  and 
Mosteller  ([1],  p.  154  ff.). 

Suppose  the  organism  continues  making  the  choices  and  that  his  prob- 
abihties  are  adjusted  after  every  trial  according  to  the  rules  just  given.  Then 
it  can  be  shown  that  sooner  or  later  the  organism  stops  making  one  of  the 
choices  and  thereafter  chooses  only  the  other.  An  extreme  example  occurs 
if  both  «!  and  aa  are  zero — then  the  organism  chooses  forever  what  he  chooses 
first  (one-trial  learning). 

One  mathematical  problem  is  to  discover  the  probability  that  the  organ- 
ism eventually  chooses  R  rather  than  L  all  the  time.  If  he  does  choose  R  all  the 
time,  then  he  is  said  to  be  "ultimately  attracted  by  R,"  or  R  is  "ultimately 
attracting."  The  desired  probability  should  be  expressible  as  a  function  of 
the  initial  probability  p  and  of  the  attractiveness  coefficients  ai  and  012  (the 
smaller  an  a,  the  more  attractive  the  side).  For  convenience,  this  will  be 
called  the  simple  approach-approach  problem,  in  contrast  to  the  more  compli- 
cated partial  reinforcement  problems. 

Consider  now  as  an  example  a  T-maze  experiment  with  paradise  fish 
described  by  Bush  and  Wilson  [2].  On  each  trial  of  this  experiment  a  fish 
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started  at  one  end  of  a  tank  and  swam  to  the  other,  where  the  left  or  right 
side  could  be  chosen.  When  the  right-hand  side  was  chosen,  the  fish  was 
rewarded  on  75  percent  of  the  trials.  When  the  left  side  was  chosen,  the  fish 
was  rewarded  on  25  percent  of  the  trials.  The  operation  was  to  place  the  reward 
on  one  side  or  the  other  every  time.  In  one  group  a  fish  was  able  to  see  the 
reward  through  a  transparent  divider  when  he  chose  the  unrewarded  side. 
In  the  other  group  an  opaque  divider  was  used.  The  data  from  these  groups 
showed  that  the  fish  tended  to  stabilize  on  one  side  or  the  other. 

Within  the  framework  of  the  operators  described  earlier  in  this  paper, 
if  p  is  the  probability  of  choosing  the  right-hand  side  on  a  given  trial,  and 
if  the  right-hand  side  is  chosen  and  rewarded,  the  new  probability  of  choosing 
the  right-hand  side  might  be  expressed  as  ap  +  1  —  a.  If  the  left-hand  side 
were  chosen  and  rewarded,  the  new  probability  of  choosing  the  right  might 
be  reduced  to  ay.  The  parallel  with  the  previous  descriptions  is  very  close. 

But  suppose  the  side  chosen  is  not  rewarded.  Then,  essentially,  three 
possibilities  exist. 

(a)  The  side  chosen  is  more  likely  to  be  chosen  than  it  was  before.  The 
explanation  might  be,  for  example,  that  the  organism  is  building  up  a  habit 
pattern,  or  that  he  is  secondarily  reinforced  for  being  in  a  place  that  earlier 
was  rewarding. 

(b)  The  side  chosen  is  less  likely  to  be  chosen  than  before.  The  ex- 
planation might  be,  for  example,  that  information  has  been  received  that 
this  side  is  not  paying  off. 

Whatever  the  explanation  may  be,  the  models  corresponding  to  (a) 
and  to  (b)  make  quite  different  predictions.  The  model  for  (a)  says  that  the 
probability  associated  with  the  side  chosen  is  always  increased  whether 
reward  is  given  or  not.  This  ultimately  implies — ^for  the  operators  described 
here — that  one  side  is  chosen  every  time,  that  is,  that  eventually  the  organism 
stabilizes  on  one  side.  On  the  other  hand,  the  model  for  (b)  would  imply 
that  the  organism  does  not  stabilize.  To  see  this,  suppose  that  an  organism 
is  certain  (p  =  1)  to  choose  the  right-hand  side — that  is,  he  has  stabilized 
on  the  right.  Then  because  of  partial  reinforcement  the  organism  will  ex- 
perience some  nonrewarded  trials  on  the  right-hand  side.  These  will  reduce 
the  probability  of  choosing  the  right-hand  side,  and  so  the  left-hand  side 
will  be  chosen  sometimes.  A  similar  argument  shows  that  the  organism 
cannot  stabilize  on  the  left.  Thus  under  partial  reinforcement,  a  model  for 
assumption  (b)  would  typically  have  asymptotic  instability.  A  subject  does 
not  become  attracted  by  one  side  or  the  other,  nor  does  he  finally  acquire  a 
fixed  probability  p  of  choosing  R.  Instead,  his  value  of  p  drifts  up  and  down, 
though  in  a  stochastically  stable  way.  Thus  model  (a)  has  attracting  and 
absorbing  barriers,  while  model  (b)  has  reflecting  barriers. 

(c)  The  probability  is  unchanged  by  a  nonreward — then  everything 
depends  upon  the  rewarded  trials. 
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In  the  experiment  with  paradise  fish  the  data  suggest  model  (a).  In 
this  paper  we  shall  deal  with  the  type  (a)  model.  On  the  basis  of  the  model, 
we  would  like  to  know  (in  terms  of  the  learning  rates,  the  initial  probabilities, 
and  the  probabilities  of  reward  on  the  two  sides)  what  fraction  of  the  organisms 
will  stabilize  on  a  given  side. 

Because  the  numerical  problem  has  turned  out  to  be  rather  trouble- 
some, and  because  the  general  problem  has  some  interest  as  shown  by  previous 
work,  we  will  sketch  various  solutions  that  have  been  tried.  Each  of  them 
is  time-consuming  in  its  development  and  testing,  so  a  research  worker  will 
want  to  know  what  ground  has  already  been  plowed. 

Previous  Work 

To  facilitate  discussion  of  previous  work  on  the  simple  approach-approach 
problem,  a  functional  equation  for  the  probability  that  an  organism  is 
ultimately  attracted  to  R  will  be  derived.  Let  /(pi  ;  ai  ,  aa)  be  the  probabiUty 
that  an  infinite  sequence  of  trials  ends  in  choices  of  R.  Here,  pi  is  the  initial 
probabiUty  of  choosing  R.  The  transition  rules  are:  if  p„  is  the  probabiUty 
of  R  on  trial  n,  then  the  probability  of  R  on  the  next  trial  is 

,,.  {aip„  -I-  1  —  ai  ,     if    R     is  chosen  on  trial  n, 

(1)  P»+i  =  •{ 

[asPn  ,  if    L     is  chosen  on  trial  n. 

In  the  sequel  there  is  usually  no  advantage  in  referring  to  the  trial  number 
associated  with  p,  so  the  subscript  on  pi  is  dropped  and  p  stands  for  the 
initial  probability.  Similarly  it  is  always  to  be  understood  that  the  desired 
function  /  depends  upon  ai  and  Ui ;  so  except  when  the  full  notation  is  needed, 
the  notation  f{p)  will  be  used. 

The  quantity  f(p)  may  be  composed  of  two  parts — the  parts  corre- 
sponding to  the  choice  of  72  or  of  L  on  the  initial  trial.  Assume  that  each 
member  of  a  large  population  has  the  same  initial  probabiUty  p  of  choosing 
R  and  is  faced  with  the  same  simple  approach-approach  problem.  Then,  on 
the  first  choice  the  fraction  p  of  the  individuals  choose  R,  and  the  new  prob- 
abiUty of  R  is  uip  -f-  1  —  ai  for  any  member  of  this  group.  This  means 
that  in  this  group,  the  probabiUty  of  being  ultimately  attracted  by  R  is 
f{aip  H-  1  —  ai).  Consequently  this  group  contributes  the  portion 
p  fiuip  4-  1  —  ai)  to  f(p).  In  the  same  manner  those  organisms  choosing  L 
first  contribute  (1  —  p)  jiazp)  to  f(p).  Thus  one  derives  the  basic  functional 
equation  for  the  simple  approach-approach  problem: 

(2)  /(p)  =  p/(a,p  -fl  -  aO  +  (1  -  p)/(a.p). 

The  boundary  conditions  are  /(O)  =  0  and  /(I)  =  1.  These  conditions  hold 
because  if  p  =  0,  then  L  occurs,  and  the  new  probability  for  R  is  az-O  =  0. 
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Therefore  L  is  always  chosen.  Similarly  Up  =  1,  then  R  occurs,  and  the  new 
probability  for  i2  is  oii-l  +  1  —  ai  =  1.  Therefore  R  is  always  chosen. 
Thus  /(O)  =  0  and  /(I)  =  1.  These  conditions  for  the  function  are  needed 
because  without  them  (2)  only  determines  /  to  within  a  linear  transformation. 
Thus  if  a  certain  /  satisfies  (2),  direct  substitution  shows  that  Af  -{-  B  also 
satisfies  it  (A  and  B  are  constants). 

Equation  (2)  could  have  had  four  parts  if  we  related  the  desired  prob- 
ability to  the  four  terms  occurring  after  two  trials,  or  more  generally  2" 
terms  after  n  trials.  These  equations  are  all  equivalent,  but  they  can  all  be 
derived  by  successive  applications  of  (2)  to  the  /'s  appearing  on  the  right- 
hand  side. 

The  properties  of  f{p)  have  been  studied  before  by  Bellman  and  by 
Shapiro  ([3],  Parts  II  and  III),  and  by  Karlin  [4]  (c.f.  [1],  p.  163-4).  Since 
not  all  of  their  results  are  readily  accessible,  those  properties  of  f(p)  especially 
useful  here  are  given  below. 

i.  Nature  of  the  solution.  Equation  (2)  has  a  unique,  monotone,  analytic 
solution  once  the  boundary  conditions  are  given.  With  our  boundary  con- 
ditions the  solution  is  convex  for  ai  >  0:2  ,  concave  for  cui  <  a2  .  The  mono- 
tonicity  is  consistent  with  the  probability  interpretation  given  by  the  learn- 
ing model — for  given  a^  and  az  ,  the  larger  the  probability  of  choosing  R 
initially,  the  more  likely  that  R  is  ultimately  attracting. 

ii.  Solutions  under  special  conditions.  In  what  follows,  suppose  the 
relevant  boundary  conditions  /(O)  =  0  and  /(I)  =  1  to  hold.  The  special 
conditions  have  to  do  with  the  values  assumed  by  one  or  both  of  the  a's. 

(a)  «!  =  az  5^  1.  The  solution  is  f{p)  =  p,  as  implied  by  the  fact  that 
f(p)  is  both  convex  and  concave  and  by  the  boundary  conditions. 

(b)  ai  =  a2  =  I.  The  function  /  is  not  defined  in  our  problem  unless 
p  =  1  or  0,  because  the  probability  of  R  never  changes  and  no  attraction 
occurs. 

(c)  «!  =  1,  a2  5^  1.  The  occurrence  of  R  leaves  the  probability  of  R 
unchanged  because  a^p  -{-  I  —  ai  =  p,  so  the  process  can  only  move  toward 
choosing  more  L's  unless  p  =  1.  Thus  f{p]  1,  az)  =  0,  a2  9^  I,  p  ^  1,  and 
/(I;  1,  «2)  =  1. 

(d)  az  —  1,  ai  9^  1.  Similarly  j{p;  ai  ,  1)  =  1,  a,  ?^  1,  p  5^  0, 
and  /(O;  aj  ,  1)  =0. 

(e)  «!  =  0.  Here,  the  only  way  to  be  ultimately  attracted  to  L  is  always 
to  choose  L.  The  probability  of  the  latter  behavior  is 

(3)  Sf(p,  a^)  =  (1  -  p){\  -  a2p){\  -  alp)  •  •  •    =    fl  (1   -  a^p) . 

1=0 

Therefore  the  probability  of  ultimate  attraction  by  R  is 

(4)  /(p;0,a2)  =  1  -  9(p,cx2). 
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(f)  cKz  =  0.  Here  to  be  ultimately  attracted  by  R  is  never  to  choose  L. 
In  this  case 

KV)  ai  ,  0)  =  vla^V  +  1  -  «i][q;i(q;j?)  +  1  -  ai)  +  1  -  ai]  •  •  • 
._.  =  ^[aip  +  1  -  aJfaiV  +  I  -  al]  ■  • - 

=  [1  -  (1  -  Vm  -  «i(l  -  p)][l  -  al{l  -V)]"- 

=    n  [1  -  «K1  -  V)]- 

In  the  second  step  above,  note  that  if  R  occurs  on  the  first  n  trials,  the  prob- 
ability of  R  is  aiP  +  1  —  a"  (proved  in  [1],  p.  59). 

iii.  Iterative  properties.  Any  continuous  initial  approximation  to  /(p) 
can  be  iterated  successively  to  obtain  in  the  limit  the  function  /(p).  That  is, 
suppose  /o(p)  is  a  first  guess  at  the  function  /(p),  then  a  better  approximation 
is  given  by  the  first  iterate 

/i(?>)  =  P/o(aiP  4-  1  -  ai)  +  (1  -  p)U{cx2p) . 

For  example  if  fo(p)    =    p,   then  fi{p)    =    p   +    (^2    —    "i)   p(l    —    ?)■ 
More  generally,  the  (n  +  l)st  iterate  is  given  by 

/n+l(p)     =    VfniaiP    +    1    -   Q!l)    +    (1     -   P)fn(a2p). 

Certain  initial  approximations  lead  to  a  monotonic  sequence  of  iterates. 

(a)  If  foip)  =  p,  the  successive  iterates  monotonically  increase  toward 
f(p)  if  02  >  «!  ,  monotonically  decrease  toward  f(p)  if  aa  <  ai  . 

(b)  If  for  the  beginning  approximation 


(6)  fo(p)  = 


Yin  -  aiil  -  p)],     for    aa  <  ai  , 

i-O 

1  -   n  [1  -  (xip],       for    az  >  «!  , 


the  iterates  increase  (decrease)  monotonically  to  the  function.  These  results 
provide  two  sequences  of  bounds  for  f(p)  when  the  approximations  mentioned 
in  (a)  and  (b)  are  used. 

The  iteration  procedure  converges  geometrically,  that  is,  after  n  itera- 
tions one  can  be  sure  that  the  nth  iterate  f„(p)  deviates  from  the  correct 
answer  f(p)  by  no  more  than  Ap",  where  ^  >  0,  and  0  <  p  <  1.  Though 
geometric  convergence  sounds  speedy,  if  p  were  near  1,  say  0.96,  it  would 
take  more  than  50  iterations  to  assure  being  within  0.1^.  The  details  needed 
for  the  calculation  of  A  and  p  will  not  be  provided. 

These  important  results  provide  a  starting  point  for  studying  the  func- 
tion f(p),  but  they  do  not  yield  numbers  or  expressions  whose  values  are 
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close  to  the  true  ones.  In  the  remainder  of  this  paper,  several  techniques  for 
approximating  f(p),  are  provided. 

A  method  designed  for  high-speed  calculation  will  be  considered  first, 
then  an  excellent  approximation  obtained  from  a  differential  equation  will 
be  considered,  and  then  that  result  will  be  extended  to  the  two-armed  bandit 
problem.  Finally,  brief  mention  of  some  other  methods  of  approximating 
this  functional  equation  will  be  given. 

Approximation  hy  Simultaneous  Equations 

Consider  a  grid  of  numbers  0  (=  po),  Pi  ,  P2  ,  •  •  •  >  Pn  ,  1  (=  Pn+i)  in 
the  unit  interval,  and  write  the  functional  equation  (2)  as  it  applies  to  each 
of  these  values  of  the  independent  variables.  (Lest  confusion  with  earlier 
notation  develop  note  that  Pi  still  refers  to  probabilities,  but  the  subscripts 
no  longer  correspond  to  trials  as  they  did  in  earlier  sections.)  Then  one  has 
the  set  of  equations 

/(O)    =0  +/(0), 

KPi)  =  PiKdiPi  +  1  -  ai)  +  (1  -  Pi)Ka2Pi), 

(7)  /(P2)  =  P2K(XiP2  +  1  -  ai)  +  (1  -  p^i{a2P^ , 

KPn)     =    PnfiaiPn   +    1     "    ^l)    +    (1     "    Pr)1{oC2P^  , 

/(I)    =/(l)  +0. 

The  first  and  last  members  of  this  set  of  equations  are,  of  course,  tautologies; 
there  are  only  n  nontrivial  equations. 

The  right-hand  sides  of  the  n  nontrivial  equations  of  the  set  (7)  each 
involves  the  values  of  j{p)  at  points  that  do  not  ordinarily  coincide  with  any 
of  the  chosen  grid  points.  However,  by  using  an  interpolation  formula,  both 
j{oLiPi  +  1  —  ai)  and  f(a2Pi),  i  =  1,  2,  •  •  •  ,  n,  may  be  approximated  by 
linear  combinations  of  the  values  of  f{p)  at  two  or  more  consecutive  grid 
points  Pi  ,  Pi+i  ,  •  •  •  .  The  number  of  grid  points  required  depends  upon 
whether  one  uses  linear  interpolation  (two  grid  points),  interpolation  with 
second  differences  (three  points),  third  differences  (four  points),  and  so  forth. 

Whatever  the  number  of  points  may  be,  each  equation  of  the  set  (7) 
can  be  replaced  by  an  approximate  equality  involving  as  unknowns  just  the 
values  of  /(p)  at  several  predetermined  grid  points,  and  these  unknowns 
occur  only  linearly.  Thus  a  system  of  n  linear  equations  is  obtained,  approxi- 
mately satisfied  by  the  n  unknown  quantities,  f(pi),  fipz),  • '  ■  ,  f(Pn)-  The 
idea  of  deriving  a  system  of  linear  equations  whose  roots  approximate  /(p<), 
i  =  1,  2,  •  •  •  ,  n,  was  first  suggested  to  us  by  J.  Arthur  Greenwood  in  an 
unpublished  memorandum,  in  which  linear  interpolation  was  used  to  approxi- 
mate f(aiPi  +  1   —  ai)  and  fiazPi). 
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In  this  and  in  the  following  sections  a  standard  numerical  example  in 
which  «!  =  .75,  ttz  =  .80  is  used  to  illustrate  the  various  methods.  This 
example  has  the  advantage  of  being  easily  displayed;  further,  numbers  are 
fairly  easy  to  compute  from  it.  It  has  the  disadvantage  of  being  relatively 
easy  to  fit,  so  the  reader  should  not  be  misled  into  thinking  that  the  precision 
attained  for  it  is  always  obtainable. 

Example.  The  method  just  described  is  illustrated  for  a  grid  of  five 
equally  spaced  points,  using  the  standard  example,  ai  =  0.75,  as  =  0.80. 
Here,  the  functional  equation  is 

(8)  Kp)  =  pK0.75p  +  0.25)  +  (1  -  p)m.80p). 

Taking  pi  =  0.25,  pz  =  0.50,  pa  =  0.75  and  writing  /(p,)  =  fi  ,  for 
short,  in  accordance  with  equations  (7) , 

/i  =  0.25/(0.4375)  +  0.75/(0.20), 

(9)  /2  =  0.50/(0.6250)  +  0.50/(0.40), 
U  =  0.75/(0.8125)  +  0.25/(0.60). 

First,  linear  interpolation  will  be  used  to  approximate  /(0.4375),  /(0.20), 
/(0.6250),  etc.,  by  means  of  linear  combinations  of  the  five  /'s:  /o(=  0), 
A  ,U  ,/3and/4(=  1).  Thus, 

,rn  Ao^r^       0.5000  -  0.4375  ,     ,    0.4375  -  0.2500  , 
^(0-^3^^) 0^500 ^^  + 0:^500 ^^ 


=  0.25/i  +  0.75/ 


2    } 


,,„,..        0.25  -  0.20  ,     ,    0.20  -  0  , 
^^^■'     ^-    0.25        ^°  +  ~0:25-^^ 

=  0.80A  , 
and,  similarly, 

/(0.6250)  ~  0.50/2  +  0.50/3  , 

/(0.40)      ~  0.40/1  +  O.6O/2  , 

/(0.8125)  ~  0.75/3  +  0.25/4  =  0.75/3  +  0.25, 

/(0.60)      ~  O.6O/2  +  0.40/3  . 

Substituting  these  approximate  expressions  for  the  several  functional  values 
in  the  right-hand  sides  of  (9)  and  collecting  all  terms  involving  the  unknowns 
into  the  left-hand  sides,  one  obtains 

0.3375/1  -  0.1875/2  ^  0, 

(10)  -0.2000/1  +  0.4500/2  -  0.2500/3  c-i  0, 

-  0.1500/2  +  0.3375/3  c-  0.1875. 


506  READINGS  IN   MATHEMATICAL  PSYCHOLOGY 

Replacing  the  ~;  by  =  in  the  set  of  approximations  (10)  and  solving 
the  resulting  equations,  one  obtains  the  following  approximations  to  /,•  . 
(The  best  available  values  are  also  shown  for  comparison.) 


Vi 

/,  (approx.) 

best  values 

0.25 

0.3385 

0.4495 

.50 

0.6093 

0.7286 

.75 

0.8276 

0.8987 

The  agreement  with  the  best  available  values  is  only  fair. 

Now  use  second-order  interpolation  for  approximating  the  non-grid- 
point  values  of  j{p)  that  occur  in  the  right-hand  sides  of  (9).  The  general 
formula  (with  equally  spaced  grid  points)  is 


Kx,  +  e) 
(11) 


1  -  i--    3  -  — 
^  Ax  V  Ax 


'•  +  fA^-  icP-' 


e 
Ax 


i  ^^-  I  1 


~  Axr^' ' 


where  x  =  x.  +  i  —  a*.  .  Note  that  (11)  gives  the  interpolated  value  as  a  weighted 
average  of  the  three  adjacent  tabled  values  instead  of  using  differences. 

Applying  (11)  to  the  problem  at  hand  and  substituting  these  approxi- 
mate expressions  into  the  right-hand  sides  of  (19),  one  obtains  the  following 
system  of  approximations. 

0.2410A  -  0.1744/2  +  0.0235/3  =  0, 

(12)  -0.1400/1  +  0.3925/2  -  0.3150/3  =  -0.0625, 

-  0.0497/2  +  0.1369/3  =  0.0872, 

whose  roots  yield  the  following  approximations. 


Pi 

/.•  (approx.) 

best  values 

.25 

0.4279 

0.4495 

.50 

0.7122 

0.7286 

.75 

0.8955 

0.8986 

These  results  are  a  definite  improvement  over  those  obtained  by  linear 
interpolation. 

The  above  example  seems  to  indicate  that  a  considerable  improvement 
of  the  approximation  can  be  expected  when  higher  differences  are  used  in  the 
interpolation  formula  for  expressing  the  non-grid-point  values  of  f{p)  in 
terms  of  the  grid-point  values.  However,  the  interpolation  formulas  become 
more  and  more  cumbersome  to  work  with  numerically  as  higher  differences 
are  included.  It  therefore  is  pertinent  to  see  how  much  improvement  can 
be  gained  by  increasing  the  number  of  grid  points  alone. 
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Improvement  Obtained  by  Increasing  the  Number  of 

Points  in  Grid,  using  Linear  Interpolation  only; 

Entries  are  Approximate  Values  of   f. 


Number  of  point 

s 

Pi 

4 

5 

6 

11 

21 

Best  value 

.10 

.1347 

.1473 

.1573 

.1864 

.1984 

.2055 

.25 

.3147 

.3388 

.3557 

.4152 

.4375 

.4495 

.50 

.5690 

.5955 

.6129 

.6872 

.7133 

.7286 

.75 

.7845 

.8007 

.8153 

.8666 

.8856 

.8987 

.90 

.9138 

.9203 

.9261 

.9476 

.9586 

.9658 

Using  only  linear  interpolation,  approximations  from  grids  of  4,  5,  6, 
11,  and  21  points  were  obtained.  These  points  were  not  equally  spaced  because 
it  was  hoped  that  better  results  would  be  obtained  by  spacing  the  grid  so 
that  the  functional  values  would  be  approximately  equally  spaced.  Infor- 
mation needed  for  such  spacing  was  available  from  other  methods  described 
later. 

Linear  interpolations  were  made  in  the  results  for  the  five  grids  de- 
scribed above  to  obtain  approximate  values  at  p  =  0.10,  0.25,  0.50,  0.75, 
0.90.  The  numbers  are  shown  in  Table  1,  together  with  the  best  known 
values. 

Using  the  difference  between  the  best  value  and  the  cell  entry  for  a  given 
Pi  as  a  measure  of  error,  it  will  be  noted  that,  very  roughly,  the  error  decreases 
linearly  with  the  spacing.  On  the  other  hand,  with  a  five-point  grid,  changing 
from  linear  to  second-order  interpolation  gives  improvement  roughly  equiv- 
alent to  that  given  by  increasing  the  number  of  points  to  21  and  using  hnear 
interpolation  only.  Since  simultaneous  equations  are  expensive  to  solve,  it 
appears  that  second-order  interpolation  is  well  worth  the  effort,  contrary  to 
usual  advice. 

Calculations,  with  the  aid  of  an  electronic  computer,  using  21  grid 
points  and  second-difference  interpolation  as  well  as  third-difference  inter- 
polation have  been  made.  The  results  are  summarized  in  Table  2.  The  results 
obtained  by  using  second-order  differences  are  hardly  distinguishable  from 
those  using  third-order  differences,  though  in  a  more  sharply  curved  example 
they  could  be  more  useful.  The  third-order  interpolation  column  provided 
numbers  labeled  "best  values"  throughout  this  paper. 

In  principle,  any  desired  degree  of  accuracy  can  be  attained  by  using 
finer  grids,  but  the  cost  of  the  calculations  increases  roughly  as  the  square 
of  the  number  of  grid  points  used.  A  high-speed  computer  could  be 
programmed  to  write  its  own  equations  and  solve  them,  but  such  a  program 
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TABLE  2 

Approximations  Using  Second-   and  Third-Order  Interpolations 

With  21  Grid  Points  and  the  Approximation 

By  Second  Order  Differential  Equation 


Pi 

f   (second-order 
interpolation) 

f . (third-order 
interpolation) 

f  (differential 
equation) 

.00 

.00000 

.00000 

.00000 

.05 

.10718 

.10778 

.10325 

.10 

.20495 

.20547 

.19839 

.15 

.29407 

.29455 

.28601 

.20 

.37528 

.37564 

.36648 

.25 

.44919 

.44947 

.44035 

.30 

.51637 

.51620 

.50774 

.35 

.57739 

.57736 

.56982 

.40 

.63279 

.63277 

.62626 

.45 

.68304 

.68305 

.67764 

.50 

.72858 

.72859 

.72435 

.55 

.76981 

.76983 

.76672 

.60 

.80711 

.80713 

.80504 

.65 

.84082 

.84084 

.83967 

.70 

.87126 

.87127 

.87088 

.75 

.89873 

.89874 

.89891 

.80 

.92349 

.92350 

.92402 

.85 

.94578 

.94579 

.94648 

.90 

.96584 

.96584 

.96648 

.95 

.98387 

.98387 

.98425 

1.00 

1.00000 

1.00000 

1.00000 

was  not  written.  If  good  accuracy  is  required,  the  techniques  proposed  in 
this  section  are  recommended. 

Approximation  by  a  Differential  Equation 

An  essential  feature  of  the  simultaneous-equations  approximation 
discussed  in  the  preceding  section  was  the  replacement  of  non-grid-point 
values  of  f{p)  by  linear  combinations  of  grid-point  values.  The  continuous 
variable  analogue  of  this  procedure  is  the  expansion  of  fiocip  -f-  1  —  ai)  and 
fioczp)  as  Taylor's  series  in  the  neighborhood  of  p.  This  approach  will  now  be 
used  to  derive  a  differential  equation  whose  solution  yields  an  approximation 
to  the  desired  function,  f(p). 

Rewriting  f{aip  -f  1  —  aO  as  f(p  +  (1  —  aO  (1  —  p)),  and  expanding 
the  latter  as  a  Taylor's  series. 


Kp  -H  (1  -  a.)(l  -  P)) 


(13) 


f(p)  -h  (1  -  ai)(l  -  p)/'(p) 

(1  -  a^yq  -  p) 


+ 


2! 


f'iv)  + 
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where  /'  and  /"  are  the  first  and  second  derivatives  of  /  with  respect  to  p. 
Similarly,  expand  /(aap)  as  follows: 


/(a.p)  =  Kv  -  (1  -  a,)v)  =  f(p)  -  (1  -  a,)pf(p) 


(14) 


+  ^^^^np)- 


Using  only  through  the  term  in  f'(p)  in  the  two  series  (13)  and  (14), 
substitute  these  expressions  for  the  functions  in  the  right-hand  side  of  the 
functional  equation  (2).  The  result  is  a  differential  equation 

(15)        ^^^  =  ^f^^^^  +  ^^  ~  "^^^^  ~  P'^^'^P^  +  ^(^  -  "')'(!  -  P)'/"(P)] 

+  (1  -  p)[f(p)  -  (1  -  a,)pr(j))  +  Ki  -  a,)yr'(p)]. 

By  rearranging  terms  in  (15), 

M(l  -  ai)'  -   [(1  -  «i)'  -  (1  -  a,y]p}r'(p)  +  {a,  -  aOfip)  =  0. 
Hence, 


f'ip)   _  2(a,  -  gQ 

I'iv)         [(1  -  «i)'  -  (1  -  cc,Y]p  -  (1  -  aO^ 


(16)  ^//^^    -  r/1 >2        n         ..  ^2^„        ,-,  x2  , 


which  is  integrated  to  yield 

[n  —  /v  ^^  "|i/(i-a) 

where  Ci  is  a  constant  of  integration,  and  a  is  an  abbreviation  for  {oci  +  a2)/2. 
Integrating  both  sides  of  (17), 

[n     _   ^  ^2  "]l+l/(l-a) 

where  Ci  and  C2  are  new  constants  of  integration. 

Determining  Ci  and  C2  from  the  boundary  conditions  /(O)   =   0  and 
/(I)  =  1,  the  final  form  of  f{p)  is 


(19)  /(?>)".«         ..  ,.a. 


A^  -  {A  -  ly 

where 

^=. (1  -  -^y 

(1  -  a,r  -  (1  -  «,)= 
and 


^        1  -  (ai  +a.)/2"^^- 
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Example:  Taking  ai  =  0.75,  aa  =  0.80,  as  before,  calculate  the  constants 
occurring  in  (19). 

^  ^  (0.25)^  -  (0.20)^  "  ^•^'^^^' 

Hence,  from  (19), 

f9(^^                           ifr^  -  260.42  -  (2.7778  -  p)^"" 
(20)  Kp)  -  ^^f;^ 

Using  (20),  calculate  the  values  of  f(p)  for  p  =  0.25,  0.50,  and  0.75. 
Pi  fi  (approx.)  best  values 


.25 

0.4403 

0.4495 

.50 

0.7244 

0.7286 

.75 

0.8989 

0.8987 

Values  of  f(p)  in  intervals  of  0.05  for  p  are  shown  in  Table  2,  where  they  may 
be  compared  with  the  best  values  so  far  obtained.  Among  the  various  approxi- 
mate methods  which  can  be  easily  carried  out  with  desk  calculators,  the 
differential  equation  method  yields  results  in  closest  agreement  with  those 
obtained  by  the  simultaneous  equations  using  21  grid  points  and  third- 
difference  interpolation. 

The  Two-Armed  Bandit 

The  differential  equation  approach  can  equally  easily  be  applied  to  the 
more  general  model  appropriate  to  the  two-armed  bandit  problem  with 
partial  reinforcement  on  each  arm  (or  the  equivalent  T-maze  experiment). 

Suppose  that  there  are  two  responses  R  and  L,  and  whichever  occurs 
a  reward  or  a  nonreward  follows.  If  R  occurs,  reward  follows  with  prob- 
ability xi  ;  if  L  occurs,  reward  follows  with  probability  ts  .  If  p  is  the  prob- 
ability of  i2  on  a  given  trial,  the  new  probability  for  R  is  as  follows. 

New  probability  Probability 

for  R  of  happening 

aip  +  1  —  ai    ]£  R  and  reward  occur  wip 

a2p  +  1  —  «2    if  -R  and  nonreward  (1  —  iri)p 

aip                     if  L  and  reward  ^2(1  —  p) 

a2P                      if  L  and  nonreward  (1  —  ^2)  (1  —  p) 

These  results  represent  a  special  case  of  those  presented  in  ([1],  p.  118,  286) 
and  discussed  briefly  on  p.  287  in  the  paragraph  following  equation  (13.22) 
in  [1]. 
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It  has  been  assumed  that  reward  is  equally  effective  on  either  side  and 
that  nonreward  is  also  equally  effective  on  either  side.  It  should  be  recalled 
that  these  transition  rules  imply  that  nonreward  improves  the  probability 
of  choosing  a  given  side,  as  discussed  in  the  opening  section  of  this  paper. 

Now  in  the  same  way  that  the  basic  functional  equation  (2)  for  the 
simple  approach-approach  problem  was  derived,  the  basic  functional  equa- 
tion for  the  two-armed  bandit  problem  with  partial  reinforcement  can  be 
derived.  The  functional  equation  for  the  proportion  f(p)  of  organisms  who 
eventually  learn  to  make  only  response  R  is 

(21)  ^^^'*   ^  P-^iK^XiP  +  1   -  ai)  +  p{l  -  Tri)f(a2p  +1-02) 

+  (1  -  p)T,fia,p)  +  (1  -  p){l  -  T,)f{a,p). 

No  generality  is  lost,  and  there  is  some  gain  in  the  sequel,  if  it  is  assumed 
that  «!  <  az  and  tti  >  tts  .  If  tti  =  1  and  7r2  =  0,  (21)  reduces  to  (2). 

Using  the  approximations  (13)  and  (14)  for  /(«,??  -f  1  —  ai)  and  /(a.p), 
respectively,  (21)  can  be  rewritten,  after  rearrangement  of  terms,  as 

[(xi  -  7r2){(l  -  a,r  -  (1  -  a,r}p  -   Ml  -  a,)' 
^^^^  +  (1  -  TrOd  -  «2)'l  ]/"(?>)  =  2(7r,  -  7r2)(a2  -  «i)/'(p). 

The  boundary  conditions  are  /(O)  =  0,  /(I)  =  1,  as  before. 

Comparing  (22)  with  the  corresponding  differential  equation,  (16), 
for  the  simpler  model,  the  general  solution  of  (22)  has  the  form 

where  the  constant  A  is  now  defined  as 

A         7ri(l  -  aiY  +  (1  -  7ri)(l  -  azY  ,        ,  ^    ^      \ 

(tti  —  wz)  L(l  —  aO    —  (1  —  0:2)  J 

while 

^  =  1  -  (a,  +a2)/2"'"^' 

as  before.  Note  that  the  expression  for  A  for  the  simple  approach-approach 
problem  is  obtained  by  substituting  tti  =  1,  ira  =  0  in  the  present  A. 

The  expression  for  A  is  undefined  when  either  ai  =  0:2  or  tti  =  xa  , 
hence  (23)  cannot  be  used.  In  each  of  these  cases,  however,  it  can  be  argued 
from  first  principles  that  the  function  sought  is  f(p)  =  p.  This  result  is  also 
given  by  the  differential  equation  (22),  which  reduces  to  f'(p)  =  0  under 
these  special  conditions. 
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Monte  Carlo  Calculations  for  Two-Armed  Bandits 

Twery  and  Bush  made  a  series  of  Monte  Carlo  calculations  on  lUiac 
of  /(0.50)  for  two-armed  bandit  experiments  with  ti  =  0.75,  ^2  =  0.25  for 
various  combinations  of  a-values.  The  case  of  ai  —  0.90,  0:2  =  0.95  will  be 
used  to  calculate  the  value  of  /(0.50)  from  (23). 
For  the  stated  parameter  values. 


(0.75)(0.10)^  +  (0.25)(0.05)^ 
(0.50) [(0.10)'  -  (0.05)'] 


=  2.1667, 


B 


1 


+  1  =  14.3333. 


1  -  (1.85/2) 
Hence,  (23)  in  this  case  becomes 

65015.7  -  (2.1667  -  p)"'^' 


(24) 


From  this  formula, 


f(p)  = 


65006.6 


/(0.50)  =  0.977, 


compared  with  Twery  and  Bush's  result,  0.970. 

The  values  of  /(0.50),  calculated  from  (23)  for  the  various  combinations 
of  alpha  values  used  by  Twery  and  Bush,  are  shown  in  Table  3  along  with 


TABLE  3 

Comparison  of  Differential  Equation  Results  (first  entry) 

With  Those  of  Twery  and  Bush  {second  entry) 

Obtained  from  the  Mean  Probability  Level  of  100  Sequences 

At  the  800th  Trial  for  Various   a,,  a,,   for   p  =  0.5 


And  IT, 


1  - 


"1'  "2' 

=  0.75 


°1    X^ 

.91 

.92 

.93 

.94 

.95 

.96 

.97 

.90 

.634 
.610 

.770 
.780 

.878 
.880 

.944 
.960 

.977 
.970 





.91 

.665 

.700 

.820 
.840 

.923 
.900 

.972 
.970 

— 

_.. 

.92 

.707 
.669 

.877 
.880 

.962 
.980 

.990 
.997 

.93 

.763 
.834 

.933 
.960 

.987 
.990 

.988 
1.000 

.94 

.835 
.787 

.975 
.990 



.95 

.916 
.826 

.996 
.999 

.96 

.979 
.932 
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the  Monte  Carlo  result  obtained  by  these  authors.  Their  numbers  were 
obtained  in  a  pseudo-experiment  in  which  100  sequences  of  800  trials  each 
were  run  with  random  numbers.  The  entry  itself  is  the  average  value  of  p 
for  the  100  sequences  at  trial  800.  Thus  it  has  some  random  variation  and 
is  pre-asymptotic  to  the  extent  that  800  trials  is  not  an  infinite  number. 
The  agreement  is  quite  encouraging  for  the  use  of  the  differential-equation 
method.  The  agreement  between  the  Monte  Carlo  results  and  the  differential 
equation  is  surprisingly  close,  considering  that  only  100  sequences  were  used 
and  that  the  differential  equation  is  only  an  approximation.  On  the  other 
hand,  both  learning  parameters  are  near  unity  in  these  examples;  in  that 
neighborhood  the  differential  equation  should  be  quite  a  good  approximation. 

T-maze  Experiment  with  Paradise  Fish 

In  the  first  section  of  this  paper,  a  T-maze  experiment  by  Bush  and 
Wilson  [2]  using  paradise  fish  was  described.  The  rate  of  reward  was  0.75 
for  response  R  and  0.25  for  response  L.  In  the  notation  of  our  model,  tti  =  0.75 
and  TTj  =  0.25.  The  learning-rate  parameters  were  estimated  to  be  ai  =  0.916 
and  ttz  =  0.942  for  the  group  in  which  the  fish  could  see  the  reward  through 
a  transparent  divider  when  they  chose  the  unrewarded  side.  The  initial 
probabiHty  for  response  R  (estimated  from  results  on  the  first  10  of  the  140 
trials)  varied  considerably  from  one  fish  to  another,  the  average  value  being 
0.496,  or  nearly  0.50.  Bush  and  Wilson  report  that  the  initial  distribution 
of  p  approximately  followed  the  symmetrical  Beta  distribution 

(25)  7/  =  3.61[p(l  -p)f-\ 

This  initial  distribution  was  used  to  calculate  the  expected  fraction 
attracted  by  R.  The  relative  areas  under  the  curve  (25)  in  the  ten  intervals 

[0,0.1],  [0.1,0.2],  ••.,[0.9,1.0] 

were  found,  the  values  of  f(p)  at  the  midpoints  of  these  intervals  were  calcu- 
lated, and  their  weighted  average  was  obtained.  The  result  was  f{p)  =  0.800. 
In  the  experiment,  Bush  and  Wilson  found  15  of  the  22  fish  in  the  ex- 
perimental group  making  nearly  all  R  responses  after  about  100  trials,  This 
leads  to  the  estimate  0.68  for  the  proportion  ultimately  attracted  to  the  R 
response.  That  result  is  only  about  one  standard  error  away  from  the  fitted 
value  0.80.  That  small  deviation  does  not  even  take  any  account  of  the 
unreliability  of  the  original  estimates  of  the  a's. 

Other  Methods 

Several  other  methods  of  approximating  the  function  have  been  explored. 
One  that  was  rather  successful  employed  the  function  f(p;  a,  0)  or 
/(I  —  p;  0,  a),  choosing  a  value  of  a  that  made  the  iterate  change  very 
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little.  This  method  was  superior  to  an  iteration  technique  beginning  with 
fo(p)  =  p. 

Since  one  knows  exactly  the  solution  to  the  functional  equation  in  the 
special  case  a^  =  a2  ,  the  notion  of  expanding  f(p;  ai  ,  az)  as  a  power  series 
in  tta  in  the  neighborhood  of  ax  suggests  itself.  Robert  R.  Bush,  in  an  un- 
published note,  developed  such  a  technique. 
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A  THEORY  OF  DISCRIMINATION  LEARNING  ^ 

FRANK  RESTLE 

Stanford  University  ^ 


This  paper  presents  a  theory  of 
two-choice  discrimination  learning. 
Though  similar  in  form  to  earHer 
theories  of  simple  learning  by  Estes 
(5)  and  Bush  and  Mosteller  (2,3), 
this  system  introduces  a  powerful 
new  assumption  which  makes  definite 
quantitative  predictions  easier  to  ob- 
tain and  test.  Several  such  predic- 
tions dealing  with  learning  and  trans- 
fer are  derived  from  the  theory  and 
tested  against  empirical  data. 

The  stimulus  situation  facing  a  sub- 
ject in  a  trial  of  discrimination  learn- 
ing is  thought  of  as  a  set  of  cues.  A 
subset  of  these  cues  may  correspond 
to  any  thing — concrete  or  abstract, 
present,  past,  or  future,  of  any  de- 
scription— to  which  the  subject  can 
learn  to  make  a  differential  response. 
In  this  definition  it  does  not  matter 
whether  the  subject  actually  makes  a 
differential  response  to  the  set  of 
cues  as  long  as  he  has  the  capacity  to 
learn  one.  An  individual  cue  is 
thought  of  as  "indivisible"  in  the 
sense  that  different  responses  cannot 
be  learned  to  different  parts  of  it. 
Informally,  the  term  "cue"  will  occa- 
sionally be  used  to  refer  to  any  set  of 
cues,  all  of  which  are  manipulated  in 
the  same  way  during  a  whole  experi- 
ment. 

^  This  paper  is  adapted  from  part  of  a 
Ph.D.  dissertation  submitted  to  Stanford 
University.  The  author  is  especially  in- 
debted to  Dr.  Douglas  H.  Lawrence  and  to 
Dr.  Patrick  Suppes  for  encouragement  and 
criticism.  Thanks  are  also  due  Dr.  W.  K. 
Estes  who  loaned  prepublication  manuscripts 
and  Dr.  R.  R.  Bush  who  pointed  out  some 
relations  between  the  present  theory  and  the 
Bush- Mosteller  model  (3). 

*  Now  at  the  Human  Resources  Research 
OfiRce,  The  George  Washington  University. 


In  problems  to  be  analyzed  by  this 
theory,  every  individual  cue  is  either 
"relevant"  or  "irrelevant."  A  cue  is 
relevant  if  it  can  be  used  by  the  sub- 
ject to  predict  where  or  how  reward  is 
to  be  obtained.  For  example,  if  food 
is  always  found  behind  a  black  card 
in  a  rat  experiment,  then  cues 
aroused  by  the  black  card  are  rele- 
vant. A  cue  aroused  by  an  object 
uncorrelated  with  reward  is  "irrele- 
vant." For  example,  if  the  reward 
is  always  behind  the  black  card  but 
the  black  card  is  randomly  moved 
from  left  to  right,  then  "position" 
cues  are  irrelevant.  These  concepts 
are  discussed  by  Lawrence  (6). 

In  experiments  to  be  considered, 
the  subject  has  just  two  choice  re- 
sponses. No  other  activities  are  con- 
sidered in  testing  the  theory.  Any 
consistent  method  of  describing  these 
two  responses  which  can  be  applied 
throughout  a  complete  experiment  is 
acceptable  in  using  this  theory. 

Theory 

In  solving  a  two-choice  discrimina- 
tion problem  the  subject  learns  to 
relate  his  responses  correctly  to  the 
relevant  cues.  At  the  same  time  his 
responses  become  independent  of  the 
irrelevant  cues.  These  two  aspects 
of  discrimination  learning  are  repre- 
sented by  two  hypothesized  processes, 
"conditioning"  and  "adaptation." 

Intuitively,  a  conditioned  cue  is  one 
which  the  subject  knows  how  to  use 
in  getting  reward.  If  ^  is  a  relevant 
cue  and  c(k,n)  is  the  probability  that 
k  has  been  conditioned  at  the  begin- 
ning of  the  wth  trial,  then 

c{k,n+l)=c(k,n)-\-e[_l-c(k,n)']     [1] 


This   article   appeared   in   Psychol.   Rev.,  1955,  62,  11-19.    Reprinted  with  permission. 
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is  the  probability  that  it  will  be  con- 
ditioned by  the  beginning  of  the  next 
trial.  On  each  trial  of  a  given  prob- 
lem a  constant  proportion,  d,  of  un- 
conditioned relevant  cues  becomes 
conditioned. 

To  the  extent  that  a  conditioned 
cue  afifects  performance,  it  contributes 
to  a  correct  response  only,  whereas 
an  unconditioned  relevant  cue  con- 
tributes equally  to  a  correct  and  to 
an  incorrect  response. 

Intuitively,  an  adapted  cue  is  one 
which  the  subject  does  not  consider 
in  deciding  upon  his  choice  response. 
If  a  cue  is  thought  of  as  a  "possible 
solution"  to  the  problem,  an  adapted 
cue  is  a  possible  solution  which  the 
subject  rejects  or  ignores.  If  a{k,n) 
is  the  probability  that  irrelevant  cue 
k  has  been  adapted  at  the  beginning 
of  the  wth  trial,  then 

a{k,n+\)=a(k,n)-]-d[\-a{k,n)']   [2]     P(n) 


is  the  probability  that  it  will  be 
adapted  by  the  beginning  of  the 
next  trial.  On  each  trial  of  a  given 
problem  a  constant  proportion  of 
unadapted  irrelevant  cues  becomes 
adapted.  An  adapted  cue  is  non- 
functional in  the  sense  that  it  con- 
tributes neither  to  a  correct  nor  to 
an  incorrect  response. 

It  will  be  noticed  that  the  same 
constant  6  appears  in  both  equations 
1  and  2.  The  fundamental  simplify- 
ing assumption  of  this  theory  deals 
with  6.     This  assumption  is  that 


e  = 


r  -\-i' 


[3] 


where  r  is  the  number  of  relevant 
cues  in  the  problem  and  i  is  the  num- 
ber of  irrelevant  cues.  Thus,  6  is  the 
proportion  of  relevant  cues  in  the 
problem.  This  proportion  is  the  same 
as  the  fraction  of  unconditioned  cues 
conditioned   on   each   trial,    and    the 


fraction  of  unadapted  cues  adapted 
on  each  trial. 

The  performance  function  p(n), 
representing  the  probability  of  a  cor- 
rect response  on  the  nth  trial,  is  in 
accord  with  the  definitions  of  condi- 
tioning and  adapting  given  above. 
The  function  is  in  the  form  of  a  ratio, 
with  the  total  number  of  unadapted 
cues  in  the  denominator  and  the  num- 
ber of  conditioned  cues  plus  one-half 
times  the  number  of  other  cues  in  the 
numerator.  Thus  conditioned  cues 
contribute  their  whole  effect  toward  a 
correct  response,  adapted  cues  con- 
tribute nothing  toward  either  re- 
sponse, and  other  cues  contribute  their 
effect  equally  toward  correct  and  in- 
correct responses.     Formally, 


Ec(M)+IECi-c(^,w)] 

-hhJ:ii-a(k,n)-] 


[4] 


r+j:Ll-a(k,n)-] 

r 

Here  23  is  the  sum  taken  over  the  r 

i 

relevant  cues  and  2Z  is  the  sum  taken 
over  the  i  irrelevant  cues. 

Some  Consequences  Regarding 
Simple  Learning 

If  the  subject  is  naive  at  the  begin- 
ning of  training,  so  that  for  any  rele- 
vant cue  k,  c(k,l)  =  0,  and  for  any 
irrelevant  cue  k,  a(k,l)  =  0,  and  if  he 
receives  n  trials  on  a  given  problem, 
then  by  mathematical  induction  it 
can  be  shown  that  if  k  is  relevant, 

c{k,n  -f  1)  =  1  -  (1  -  e)»     [5] 

and  if  k  is  irrelevant, 

a(k,n  +  1)  =  1  -  (1  -  d)\     [6] 

Under  these  circumstances  we  can 
substitute  equations  5  and  6  into 
equation  4  and,  taking  advantage  of 
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the  simplifying  efifects  of  equation  3, 
we  have 

Plotting  equation  7  shows  that  p 
is  an  S-shaped  function  of  n  with  an 
asymptote  (for  0  >  0)  at  1.00.  Also, 
p{l)  —  |.  Since  p{n)  is  a  monotonic 
increasing  function  of  6  we  can  esti- 
mate 6  from  observations  of  per- 
formance. If  we  want  to  know  the 
theoretical  proportion  of  relevant  cues 
in  a  problem  for  a  particular  subject, 
we  have  the  subject  work  on  the  prob- 
lem, record  his  performance  curve, 
and  solve  equation  7  for  6.  This 
result  depends  directly  upon  the  sim- 
plifying  assumption    of   equation   3. 

Since  the  instability  of  individual 
learning  curves  makes  it  difficult  to 
fit  curves  to  them,  it  is  fortunate  that 
6  can  be  determined  in  a  dififerent  way. 
Suppose  a  subject  makes  E  errors  in 
the  course  of  solving  the  problem  to 
a  very  rigorous  criterion  and  it  is 
assumed  for  practical  purposes  that 
he  has  made  all  the  errors  he  is  going 
to  make.  Theoretically,  the  total 
number  of  errors  made  on  a  problem 
can  be  written 

n=l 

Under  the  conditions  satisfying  equa- 
tion 7,  this  can  be  evaluated  approxi- 
mately by  using  the  continuous  time 
variable  t  in  place  of  the  discrete  trial 
variable  n,  and  integrating.  The 
result  of  this  integration  is  that 


^^l  +  i 


log  9 


(1  -  6)  log  (1  -  ey 


[8] 


By  equation  8,  which  relates  the  total 
number  of  errors  made  on  a  problem 
to  d,  it  is  possible  to  make  relatively 
stable  estimates  of  0. 


An  Empirical  Test  of  the  Simple 
Learning  Theory — 
Combination  of  Cues 

Consider  three  problems,  Si,  S2,  and 
53,  all  of  which  involve  the  same  irrele- 
vant cues.  Two  of  the  problems,  5i 
and  52,  have  entirely  separate  and 
different  relevant  cues,  while  in  prob- 
lem 53  all  the  relevant  cues  of  Si  and  52 
are  present  and  relevant.  That  is, 
^3  =  ri  -\-  ^2  and  ii  =  i^  =  i^.  If  we 
know  di  and  62  we  can  compute  63, 
since  by  equation  3 

Bi  =  YxKry  +  i) 

02  =  r2/(r2  +  i) 

03  =  {ri  +  r2)/iri-\-r2-^i). 

Solving  these  equations  for  ^3  in  terms 
of  01  and  02  we  get 

03  =  (01  +  02-  20102) /{I  -  0A).     [9] 

This  theorem  answers  the  following 
question :  Suppose  we  know  how  many 
errors'  are  made  in  learning  to  use 
differential  cue  X  and  how  many  are 
used  to  learn  cue  Y,  then  how  many 
errors  will  be  made  in  learning  a  prob- 
lem in  which  either  X  or  Y  can  be 
used  (if  X  and  Y  are  entirely  dis- 
crete) ? 

Eninger  (4)  has  run  an  experiment 
which  tests  equation  9.  Three  groups 
of  white  rats  were  run  in  a  T  maze 
on  successive  discrimination  problems. 
The  first  group  learned  a  visual  dis- 
crimination, black-white,  the  second 
group  learned  an  auditory  discrimina- 
tion, tone-no-tone,  and  the  third 
group  had  both  cues  available  and 
relevant. 

Since  each  group  was  run  to  a 
rigorous  criterion,  total  error  scores 
are  used  to  estimate  0i  and  02  by  equa- 
tion  8.^     The   values   estimated   are 

'  Total  error  scores  do  not  appear  in 
Eninger's  original  publication  and  are  no 
longer  known.  However,  trials-to-criterion 
scores  were  reported.     Total  error  scores  were 
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di  =  .020,  based  on  an  estimated 
average  of  98.5  errors  made  on  the 
auditory-cue  problem,  and  62  —  .029, 
based  on  an  estimated  average  of 
64.5  errors  on  the  visual-cue  problem. 
Putting  these  two  values  into  equa- 
tion 9  we  get 

^3  =  .029  +  .020  +  2  (.020)  (.029)/ 

1  -  (.020)  (.029) 
=  .049. 

This  value  of  63  substituted  into 
equation  8  leads  to  the  expectation  of 
about  33  total  errors  on  the  combined 
cues  problem.  In  fact,  an  average  of 
26  errors  was  made  by  the  four  sub- 
jects on  this  problem.  The  predic- 
tion is  not  very  accurate.  However, 
only  14  animals  were  employed  in  the 
entire  experiment,  in  groups  of  five, 
five,  and  four.  Individual  differences 
among  animals  within  groups  were 
considerable.  If  account  is  taken  of 
sampling  variability  of  the  two  single- 
cue  groups  and  of  the  combined-cue 
group  of  subjects,  the  prediction  is 
not  significantly  wrong.  Further  ex- 
perimentation is  needed  to  determine 
whether  the  proposed  law  is  tenable. 

It  is  easily  seen  that  63  will  always 
be  larger  than  di  or  62  if  all  three 
problems  are  solved.  Learning  will 
always  be  faster  in  the  combined-cues 
problem.  Eninger  (4)  in  his  paper 
points  out  that  this  qualitative  state- 
ment is  a  consequence  of  Spence's 
theory  of  discrimination.  However, 
Spence's  theory  gives  no  quantitative 
law. 

Transfer  of  Training 

In  order  to  apply  this  theory 
to  transfer-of-training  experiments  in 
which  more  than  one  problem  is  used, 
certain  assumptions  are  made.     It  is 

estimated  from  trials- to-cr  iter  ion  scores  by 
using  other,  comparable  data  collected  by 
Amsel  (1).  Dr.  Amsel  provided  detailed 
results  in  a  personal  communication. 


assumed  that  if  a  cue  is  conditioned  in 
one  problem  and  appears  immediately 
thereafter  as  a  relevant  cue  in  a  new 
problem,  it  is  still  conditioned.  Like- 
wise, an  adapted  cue  appearing  as  an 
irrelevant  cue  in  a  new  problem  is 
adapted.  However,  if  a  conditioned 
cue  is  made  irrelevant  it  is  obviously 
no  longer  conditioned,  since  it  cannot 
serve  as  a  predictor  of  reward.  Simi- 
larly, it  is  assumed  that  if  an  adapted 
cue  is  made  relevant  in  a  new  problem, 
it  becomes  unadapted  and  available 
for  conditioning. 

According  to  the  present  definition 
of  conditioning,  a  conditioned  cue 
contributes  to  a  correct  response. 
Therefore  the  above  assumptions  will 
not  hold  if  the  relation  between  a  cue 
and  a  reward  is  reversed  in  changing 
the  problem.  This  theory  cannot  be 
used  to  analyze  reversal  learning,  and 
is  applicable  only  in  cases  in  which 
relevant  cues  maintain  an  unchanging 
significance. 

If  two  problems  are  run  under  the 
same  conditions  and  in  the  same  appa- 
ratus, and  differ  only  in  the  degree  of 
difference  between  the  discriminanda 
(as  where  one  problem  is  a  black- 
white  and  the  other  a  dark  gray-light 
gray  discrimination),  it  is  assumed 
that  both  problems  involve  the  same 
cues ;  but  the  greater  the  difference  to 
be  discriminated,  the  more  cues  are 
relevant  and  the  less  are  irrelevant. 

Empirical  Tests  of  the  Transfer- 
of-Training  Theory 

As  Lawrence  (7)  has  pointed  out, 
it  seems  that  a  difficult  discrimination 
is  more  easily  established  if  the  sub- 
jects are  first  trained  on  an  easy  prob- 
lem of  the  same  type  than  if  all 
training  is  given  directly  on  the  diffi- 
cult discrimination.  The  experimen- 
tal evidence  on  this  point  raises  the 
question   of   predicting  transfer  per- 
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formance  from  one  problem  to  an- 
other, where  the  two  problems  involve 
the  same  stimulus  dimension  but  dififer 
in  difficulty. 

Suppose  that  problems  Si  and  52 
both  require  a  discrimination  along 
the  same  stimulus  dimension  and 
differ  only  in  that  Sz  is  more  difficult 
than  5i.  Let  61  be  the  proportion  of 
relevant  cues  in  problem  Si  and  62  be 
the  proportion  of  relevant  cues  in  52. 
Suppose  that  the  training  schedule 
involves  n  trials  on  problem  Si  fol- 
lowed by  7  trials  on  problem  52.  Then 
the  probability  of  a  correct  response 
on  trial  n  -\-  j  is 


lem  without  prior  experience,  their 
performance  on  the  first  problem 
serves  to  estimate  9i,  the  proportion 
of  relevant  cues  in  the  easier  pretrain- 
ing  problem.  Lawrence  replicated 
the  experiment,  having  two  experi- 
mental groups,  ATG  No.  1  and  ATG 
No.  2,  each  of  which  transferred 
abruptly  from  an  easy  pretraining 
problem  to  the  test  problem.  Group 
ATG  No.  1  had  a  very  easy  prob- 
lem for  which  we  estimate  61  =  .14. 
Group  ATG  No.  2  had  a  more  diffi- 
cult problem  for  which  di   =  .07. 

For  group  ATG  No.   1,  di  =  .U, 
02  =  .04,    and    n  =  30    since    thirty 


Pin+j)  = 


62  +  Kl  -  e2y-'le,  -  62+  {1  -  gi)"(l  -d,-  62)'] 

^2  +  (1  -  Bty-^ie^  -  02  +  (1  -  ei)n+i]         • 


[10]' 


This  theorem  can  be  tested  against 
the  results  of  experiments  reported  by 
Lawrence  (7).  He  trained  white  rats 
in  one  brightness  discrimination  and 
transferred  them  to  a  more  difficult 
problem  for  further  training.  A  con- 
trol group,  which  Lawrence  called 
"HDG,"  learned  the  hard  test  prob- 
lem without  work  on  any  other 
problem.  The  performance  of  this 
control  group  is  used  to  estimate  O2, 
the  proportion  of  relevant  cues  in  the 
test  problem.  The  value  found  was 
.04.^  Since  the  experimental  subjects 
first  worked  on  the  pretraining  prob- 

*  The  justification  of  equation  10  involves 
no  mathematical  difficulties.  On  the  first 
trial  of  transfer  we  know  the  probability 
that  any  cue  relevant  in  the  second  problem 
is  conditioned,  since  all  cues  relevant  in  the 
second  problem  were  relevant  in  the  first. 
Similarly,  we  know  the  probability  that  ii  of 
the  ii  irrelevant  cues  are  adapted.  The 
other  ii  —  ii  cues  are  unadapted.  Equations 
1  and  2  can  be  applied  at  this  point,  and  all 
terms  divided  by  ri  +  i\{=  rt  +  ii). 

*  These  estimates  were  made  by  the  un- 
satisfactory method  of  fitting  equation  7  to 
group  average  learning  curves.  Therefore 
the  results  regarding  Lawrence's  experiment 
are  approximate. 


trials  of  pretraining  were  given.  From 
this  information  we  can  compute 
P{n  -\- j)  for  all  J,  using  equation  10. 
The  predicted  transfer  performance  is 
compared  with  observed  performance 
in  Table  1.  For  group  ATG  No.  2, 
di'  =  .07,  02  =  .04,  and  n  =  50  since 
fifty  trials  of  pretraining  were  given. 
Here  also,  p(n  +i)  can  be  computed. 
Prediction  is  compared  with  observed 
performance  in  Table  1,  from  which 
it  can  be  seen  that  the  predictions  are 

TABLE  1 

Prediction  of  Easy-to-Hard  Transfer 
IN  Rats* 


Proportion  of  Correct  Responses 

Trials  of 
Transfer 
Training 

Group  ATG  1 

Group  ATG  2 

Observed 

Predicted 

Observed 

Predicted 

1-10 

.66 

.63 

.81 

.71 

11-20 

.70 

.68 

.83 

.77 

21-30 

.74 

.72 

.81 

.81 

31^0 

.84 

.78 

41-50 

.86 

.83 

*  Data  from  Lawrence  (7). 
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relatively  accurate,  though  perform- 
ance is  higher  than  predicted. 

Lawrence  also  considered  the  possi- 
bility that  a  gradual  transition  from 
easy  through  successively  harder  prob- 
lems would  result  in  rapid  mastery  of 
the  difificult  problem.  He  tested  this 
proposition  by  giving  another  group 
of  subjects  a  series  of  three  pretest 
problems  before  the  final  test  problem. 
The  problems  in  order  of  ease  of  learn- 
ing were,  first,  the  problem  learned  by 
ATG  No.  1  with  ^i  =  .14,  an  inter- 
mediate problem  which  was  not  other- 
wise used,  the  difficult  pretest  problem 
with  6z  =  .07,  and  finally  the  test 
problem  with  di  =  .04. 

To  estimate  02  in  Lawrence's  experi- 
ment where  problem  ^2  never  was 
used  separately  in  simple  learning, 
we  notice  the  relation  of  d  to  differ- 
ences between  discriminanda  in  appa- 
rent foot-candles  for  problems  Si,  S3, 
and  Si  whose  6  values  are  known. 
We  know  that  if  the  problems  are 
properly  controlled,  and  the  stimulus 
difference  is  zero  foot-candles,  there 
are  no  relevant  cues  and  6  is  zero.  It 
was  found  that  this  assumption,  along 
with  available  data,  made  it  possible 
to  write  a  tentative  empirical  function 
relating  6  to  the  difference  between 
discriminanda  in  foot-candles.  This 
equation  presumably  holds  only  in  the 
case  of  Lawrence's  apparatus,  train- 

TABLE  2 

The  Relation  of  "Difference  Between 

STIMtTLl"  AND  6  VaLUE  OF  PROBLEM* 


TABLE  3 

Prediction  of  Transfer  Performance  of 

Rats  After  a  Series  of  Pretraining 

Problems* 


Difference  Between 

Discriminanda  in 

Apparent  Foot- Candles 

Corresponding 

e  Value  of 

Problem 

67.7 

.14 

35.2 

.113** 

14.0 

.07 

5.9 

.04 

0.0 

.oot 

Trials  Working 

on  Final  Test 

Problem 

Proportion  of  Correct  Responses 

Observed 

Predicted 

1-10 

11-20 
21-30 
31-40 
41-50 

.73 
.82 
.87 
.89 
.90 

.73 
.79 
.84 
.87 
.90 

*  Data  from  Lawrence  (7). 


The 


*  Data  from  Lawrence  (7). 
**  Estimated   by  interpolation  from  empirical  equa- 
tion 16. 

t  Theoretical — see  text  for  explanation. 


ing    procedure,    subjects,    etc. 
equation  adopted  is 

e  =  .09881ogio(.4J)        [11] 

where  d  is  the  difference  between  dis- 
criminanda in  foot-candles.  It  is  em- 
phasized that  this  equation  has  no 
theoretical  significance  and  is  merely 
expedient.  From  equation  11  it  is 
possible  to  determine  the  6  value  of 
the  intermediate  pretraining  problem 
by  interpolation.  Table  2  gives  the 
data  and  results  of  this  interpolation. 
Ten  trials  were  given  on  each  of  the 
first  three  problems  and  fifty  trials 
on  the  final  test  problem.  Using  the 
6  values  in  Table  2  it  is  possible  to 
predict  the  test  problem  performance 
of  subjects  who  have  gone  through 
gradual  transition  pretraining.^  This 
prediction  is  compared  with  observed 
performance  in  Table  3.  It  may  be 
noted  that  the  correspondence  be- 
tween prediction  and  observation  is 
in  this  case  very  close.  Again,  how- 
ever, the  prediction  is  consistently  a 
little  lower  than  observed  perform- 
ance. 

*  The  general  prediction  for  transfer  through 
a  series  of  problems  which  get  successively 
more  difficult  can  be  derived  by  following 
through  and  repeating  the  reasoning  in  foot- 
note 4.  Since  the  resulting  equations  are 
extremely  large  and  can  be  derived  rather 
easily,  they  are  not  given  here. 
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New  Data 

The  theory  has  thus  far  been  tested 
against  the  behavior  of  rats.  Its 
generahty  is  now  tested  with  college 
students  in  a  simple  discrimination 
learning  task. 

Subjects  and  procedure.  The  subjects  in 
this  experiment  were  23  students  in  the  ele- 
mentary psychology  course  at  Stanford  Uni- 
versity. The  5  was  seated  at  one  end  of  a 
table  and  told  that  his  responses  could  be 
either  "^"  or  "5".  On  each  trial  5  saw  a 
single  stimulus,  which  was  a  black  square  on 
a  circular  white  background.  The  two 
squares  used  on  alternate  trials  differed  in 
size.  In  problem  Si  the  squares  differed  in 
height  by  \  in.,  in  problem  52  they  differed 
by  \  in.  The  mean  height  of  each  pair  of 
squares  was  3  in.  The  squares  were  viewed 
at  a  distance  of  about  6  ft. 

For  half  the  5s  in  each  experimental  group, 
the  problem  was  to  say  "A"  to  the  smaller 
square  and  "5"  to  the  larger  one.  The  other 
^s  had  the  converse  problem.  The  5  was 
never  told  that  the  problem  was  a  size  dis- 
crimination. Stimuli  were  alternated  ran- 
domly. A  rest  period  was  called  after  each 
ten  trials  and  S  was  asked  what  he  thought 
the  correct  solution  to  the  problem  was,  and 
to  outline  possible  solutions  which  had  oc- 
curred to  him.  This  method  of  questioning 
is  a  modification  of  Prentice's  method  (8). 

Twelve  5s  were  trained  first  on  problem  5i 
to  a  criterion  of  15  successive  correct  responses 
and  then  transferred  to  problem  52  and  run  to 
the  same  criterion.  These  ^s  made  up  the 
"Easy-Hard  Transfer  Group"  called  EH. 
The  other  11  5s  were  trained  first  on  S2  and 
then  transferred  to  si.  This  was  the  "Hard- 
Easy  Transfer  Group"  called  HE.  The  two 
groups  were  approximately  equated  for  age, 
sex,  and  known  special  visual  skills. 

Results.  Using  the  pretraining  per- 
formance of  the  EH  group,  the  aver- 
age proportion  of  relevant  cues,  di, 
was  estimated  at  .254  by  equation  8. 
Using  the  pretraining  performance  of 
the  HE  group,  the  average  proportion 
of  relevant  cues  in  problem  52  was 
estimated  at  62  =  .138. 

The  transfer  performance  of  group 
EH,  which  first  learned  the  easy  and 
then  the  hard  problem,  is  predictable 
by  equation  10.     Since  these  subjects 


worked  to  a  high  criterion  in  pretrain- 
ing, we  can  assume  that  p{n)  is 
negligibly  different  from  one  at  the 
end  of  pretraining.  Then  by  equa- 
tion 7  we  see  that  (1  —  0i)"~^  is  small, 
and  equation  10  simplifies  to 


p{n+j)  = 


02+(l-^2)'-H^l-02) 


.[12] 


This  theoretical  function  of  j  is  com- 
pared with  observed  transfer  per- 
formance in  Table  4.  It  is  seen  that 
the  correspondence  is  quite  close  with 
a  negligible  constant  error. 

This  prediction  is  based  on  the 
formula  which  also  predicted  Law- 
rence's rat  data.  This  confirmation 
suggests  that  the  law  can  be  applied 
to  human  as  well  as  rat  performance 
on  this  type  of  task. 

Using  the  line  of  reasoning  which 
developed  equation  10  we  can  pro- 
duce an  equation  to  predict  transfer 
performance  from  hard  to  easier  prob- 
lems of  the  same  sort.  Certain  cues 
are  relevant  in  the  easy  problem 
which  were  irrelevant  in  the  harder 
one.  These  cues  cannot  be  identified 
in  the  hard  problem.  For  perform- 
ance to  be  perfect  in  the  easier  prob- 
lem all  relevant  cues  must  be  identi- 
fied. Therefore,  when  the  subject 
transfers  from  the  hard  to  the  easier 

TABLE  4 

Prediction  of  Transfer  of  Training  from 

Easier  to  Harder  Problem  in 

Human  Subjects 


Trials  after 

Transfer  to 

Second  Problem 

Proportion  of  Correct  Responses 

Observed 

Predicted 

1-5 

.817 

.821 

6-10 

.933 

.895 

11-15 

.926 

.941 

16-20 

.933 

.966 

21-25 

.966 

.988 

26-30 

.983 

.994 

31-35 

1.000 

1.000 
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TABLE  5 

Prediction  of  Transfer  of  Training  from 

Harder  to  Easier  Problem  in 

Human  Subjects 


Trials  After 

Transfer  to 

Second  Problem 

Proportion  of  Correct  Responses 

Observed 

Predicted 

1-4 

5-8 

9-12 

13-16 

.932 

.955 

.955 

1.000 

.883 
.960 
.984 
.995 

problem  we  should  expect  some  small 
number  of  errors  to  be  made.  On  the 
assumption  that  the  hard  problem 
was  completely  learned  in  pretraining, 
the  formula  for  transfer  performance 
on  the  easy  problem  is 


P(n-\-j)  = 


e2-\-(di-e2)(i-diy-' 


[13] 


where  di  is  the  proportion  of  relevant 
cues  in  the  easy  problem  and  62  is 
the  proportion  of  relevant  cues  in  the 
harder  problem.  The  proof  of  this 
theorem  is  similar  to  that  of  equation 
12  above,  and  is  not  given  here. 

Equation  13  yields  the  prediction 
for  transfer  performance  of  the  HE 
subjects.  In  Table  5  the  prediction 
is  compared  with  observed  transfer 
performance. 

Despite  the  very  small  frequencies 
predicted  and  observed,  the  predic- 
tion is  quite  accurate.  In  all,  seven 
errors  were  made  by  eleven  subjects, 
whereas  a  total  of  eight  were  expected. 
This  is  an  average  of  .64  errors  per 
subject  observed,  and  .73  predicted. 

Discussion 

The  definition  of  a  "cue"  in  terms 
of  possible  responses  is  selected  be- 
cause the  theoretical  results  do  not 
depend  critically  upon  the  nature  of 
the  stimulating  agent.  While  cues 
are  thought  of  as  stimulus  elements, 


these  elements  need  not  be  of  the 
nature  of  "points  of  color"  or  "ele- 
mentary tones."  If  a  subject  can 
learn  a  consistent  response  to  a  certain 
configuration  despite  changes  in  its 
constituents,  then  the  configuration 
is  by  definition  a  cue  separate  from 
its  constituents.  The  intention  is  to 
accept  any  cue  which  can  be  demon- 
strated to  be  a  possible  basis  for  a 
differential  response. 

The  process  of  conditioning  de- 
scribed in  this  paper  is  formally 
similar  to  the  processes  of  condi- 
tioning of  Estes  (5)  and  Bush  and 
Mosteller  (2,3).  In  the  present 
theory  conditioning  takes  place  at 
each  trial,  not  only  on  "reinforced" 
trials.  In  earlier  theories  condition- 
ing is  said  to  occur  only  on  such  rein- 
forced trials.  In  two-choice  discrimi- 
nation the  incorrect  response  has  a 
high  initial  probability  (one-half)  be- 
cause of  the  nature  of  the  physical 
situation  and  the  way  of  recording 
responses.  Therefore,  a  theory  of 
two-choice  learning  must  account  for 
the  consistent  weakening  of  such  re- 
sponses through  consistent  nonrein- 
forcement. 

The  notion  of  adaptation  used  here 
is  formally  analogous  to  the  operation 
of  Bush  and  Mosteller's  Discrimina- 
tion Operator  "Z>"  (3).  However, 
whereas  Bush  and  Mosteller's  operator 
is  applied  only  on  trials  in  which  the 
reward  condition  is  reversed  for  a  cue, 
the  present  theory  indicates  that  this 
process  takes  place  each  trial.  In 
addition,  while  the  Discrimination 
Operator  and  the  process  of  adapta- 
tion are  both  exponential  in  form. 
Bush  and  Mosteller  introduce  a  new 
exponential  constant  k  for  this  pur- 
pose and  the  present  theory  uses  the 
conditioning  constant  6. 

The  major  point  differentiating  the 
present  theory  from  similar  earlier 
theories  is  the  use  of  the  strong  sim- 
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plifying  assumption  identifying  the 
exponential  constant  6  with  the  pro- 
portion of  relevant  cues.  This  as- 
sumption may  appear  intuitively  un- 
likely, but  if  it  should  be  shown  by 
further  experiment  to  be  tenable,  the 
predictive  power  of  discrimination 
learning  theory  is  enhanced.  There 
seems  to  be  no  reason  for  abandoning 
so  useful  an  assumption  unless  experi- 
mental results  require  it. 

Summary 

A  theory  of  two-choice  discrimina- 
tion learning  has  been  presented. 
The  theory  is  formally  similar  to 
earlier  theories  of  Estes  (5)  and  Bush 
and  Mosteller  (3)  but  differs  some- 
what in  basic  concepts  and  uses  a 
new  simplifying  assumption. 

From  this  theory  three  empirical 
laws  are  derived :  one  dealing  with  the 
combination  of  relevant  cues,  and  two 
dealing  with  a  special  type  of  transfer 
of  training.  These  laws  permitted 
quantitative  predictions  of  the  be- 
havior of  four  groups  of  rats  and  two 
groups  of  human  subjects.  Five  of 
these  six  predictions  were  quite  accu- 


rate, and  the  sixth  was  within  the 
range  of  reasonable  sampling  devia- 
tion. 
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THE  ROLE  OF  OBSERVING  RESPONSES  IN 
DISCRIMINATION  LEARNING ' 

Part  I 

BY  L.  BENJAMIN  WYCKOFF,  JR. 

University  of  Wisconsin 


Theorists  in  the  area  of  discrimina- 
tion learning  have  often  had  occasion 
to  refer  to  a  set  or  predisposition  of  5 
to  learn  differential  responses  to  a  par- 
ticular pair  of  stimuli.  Such  a  pre- 
disposition has  often  been  attributed 
to  some  reaction  of  5  such  as  an  at- 
tending response,  orienting  response, 
perceiving  response,  sensory  organiza- 
tional activity,  etc.  To  implement 
the  discussion  of  the  role  of  such  re- 
actions in  discrimination  learning  we 
shall  adopt  the  term  "observing  re- 
sponse" (Ro)  to  refer  to  any  response 
which  results  in  exposure  to  the  pair 
of  discriminative  stimuli  involved. 
The  probability  of  occurrence  of  an 
observing  response  will  be  denoted  by 
po.  These  responses  are  to  be  dis- 
tinguished from  the  responses  upon 
which  reinforcement  is  based;  that  is, 
running,  turning  right  or  left,  lever 
pressing,  etc.,  which,  for  convenience, 
we  shall  term  "effective  responses." 

Spence  (19)  has  proposed  a  theory 
of  discrimination  which  is  specifically 
intended  to  deal  with  situations  where 
no  observing  response  is  required  of  S, 
that  is  to  say,  to  situations  in  which  5 
is  certain  to  be  exposed  to  the  dis- 
criminative stimuli  on  each  trial  or 
prior  to  each  effective  response  (po  = 
1).  The  fact  that  in  some  discrimina- 
tion experiments  this  condition  has 
not  been  satisfied  has  become  an  issue 

1  This  paper  is  submitted  in  partial  fulfill- 
ment of  the  requirements  for  the  degree  of 
Doctor  of  Philosophy,  in  the  Department  of 
Psychology,  Indiana  University.  The  writer 
wishes  to  express  his  appreciation  to  Dr.  C.  J. 
Burke  for  his  invaluable  guidance  and  stimu- 
lation. 


in  the  literature,  largely  because  it 
became  necessary  to  delimit  clearly 
the  situations  to  which  Spence's 
theory  is  intended  to  apply. 

Spence's  theory  of  discrimination  states 
that  stimulus-response  connections  are 
strengthened  or  weakened  during  discrimi- 
nation training  in  essentially  the  same  way 
as  these  changes  would  occur  during  condi- 
tioning or  extinction.  When  a  response  is 
reinforced  the  connections  between  it  and 
all  aspects  of  the  stimulus  situation  im- 
pinging on  S  at  the  time  the  response  oc- 
curred will  be  strengthened.  These  connec- 
tions will  be  weakened  when  the  response 
is  not  reinforced.  Certain  implications  of 
this  theory  were  questioned  by  Krechevsky 
(11)  and  other  theorists,  and  became  the 
subject  matter  of  the  "continuity-disconti- 
nuity" controversy.  This  material  has  been 
reviewed  a  number  of  times  (2,  5)  and 
need  not  be  repeated  in  detail  here.  One 
aspect  of  the .  controversy  is  pertinent  to 
the  present  discussion.  Krechevsky  (12) 
presented  experimental  findings  which  indi- 
cated that  rats  learned  nothing  with  respect 
to  two  stimulus  patterns  during  the  first  20 
trials  of  a  discrimination  experiment  even 
though  they  were  systematically  reinforced 
for  approaching  a  particular  pattern  dur- 
ing this  interval.  Failure  to  learn  was  es- 
tablished by  showing  a  lack  of  interference 
when  5s  were  tested  on  a  reversed  dis- 
crimination. These  findings  were  in  appar- 
ent disagreement  with  the  data  obtained  by 
McCuUoch  and  Pratt  (13)  in  a  similar  ex- 
periment in  which  differing  weights  were 
used  as  discriminative  stimuli.  Here  in- 
terference was  obtained,  indicating  that 
some  cumulative  learning  had  occurred  in 
the  early  portion  of  the  experiment. 

In  interpreting  these  results,  Spence  (20. 
p.  277)  argued  that  the  stimuli  (patterns) 
used  by  Krechevsky  were  not  suflSciently 
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conspicuous  to  provide  a  legitimate  test  of 
his  theory.  He  suggested  that  5s  had  not 
learned  to  orient  toward  the  stimuli  within 
the  first  20  trails.  He  points  out  that  in 
such  cases,  "...  the  animal  must  learn  to 
orient  and  fixate  its  head  and  eyes  so  as  to 
receive  the  critical  stimuli."  He  then  sug- 
gests a  way  in  which  this  learning  may 
occur.  "These  reactions  are  learned  .  .  . 
because  they  are  followed  within  a  short 
temporal  interval  by  the  final  goal  re- 
sponse." 

This  interpretation  was  put  to  an  experi- 
mental test  by  Ehrenfreund  (5).  In  his 
exp>eriment  the  likelihood  of  5's  receiving 
the  critical  stimuli  was  manipulated  by 
changing  the  position  of  the  stimuli  (up- 
right and  inverted  triangles)  with  respect 
to  the  landing  platform  of  a  jumping  stand. 
The  design  of  the  experiment  was  essen- 
tially the  same  as  Krechevsky's.  The  re- 
sults conform  to  Spence's  interpretation. 
When  the  stimuli  were  placed  relatively 
high,  no  learning  occurred  within  the  first 
40  trials,  whereas  when  the  stimuli  were 
placed  closer  to  the  landing  platform  learn- 
ing did  occur.  Learning  was  again  meas- 
ured in  terms  of  interference  in  the  learn- 
ing of  a  subsequent  reversed  discrimination. 

The  analysis  of  discrimination  situ- 
ations in  w^hich  some  observing  re- 
sponse is  required  is  of  interest  for 
several  reasons.  First,  discrimination 
learning  in  situations  other  than  labor- 
atory experiments,  such  as  human 
learning  in  the  course  of  every  day 
events,  is  largely  of  this  kind.  Sec- 
ondly, even  in  the  most  closely  con- 
trolled laboratory  experiments  it  is 
seldom,  if  ever,  possible  to  say  with 
certainty  that  5  is  exposed  to  the  dis- 
criminative stimuli  prior  to  each  effec- 
tive response.  In  the  case  of  pattern 
discriminations  it  has  been  demon- 
strated by  Ehrenfreund  (5)  that  rela- 
tively small  dififerences  in  the  position 
of  the  discriminative  stimuli  will 
effect  discrimination  learning,  indicat- 
ing that  relatively  precise  fixation  of 
the  stimulus  is  required. 

In  the  present  paper  an  attempt 


will  be  made  to  develop  a  more  ex- 
tensive theory  of  discrimination  which 
will  include  situations  in  which  some 
observing  response  (hereafter  referred 
to  as  Ro)  is  required  before  5  is  ex- 
posed to  the  discriminative  stimuli. 
An  example  of  such  a  situation  would 
be  an  experiment  in  which  stimulus 
cards  were  placed  overhead.  In  this 
case  the  response  of  raising  the  head 
would  be  the  Ro). 

If  we  accept  the  notion  that  changes 
in  po  can  be  accounted  for  within  the 
framework  of  reinforcement  learning 
theory,  it  should  be  possible  to  devise 
a  theory  of  discrimination  which  will 
include  those  cases  where  some  Ro  is 
necessary.  The  purpose  of  this  paper 
is  to  outline  such  a  theory.  We  shall 
see  that  by  analyzing  discrimination 
learning  in  this  way  it  will  be  possible 
to  account  for  stimulus  generalization 
and  also  changes  in  generalization  dur- 
ing discrimination  learning  without 
postulating  any  direct  interaction  be- 
tween stimuli.  Several  hypotheses 
will  be  derived  from  this  theory  which 
have  been  tested  in  an  experiment  by 
the  author  presented  in  detail  else- 
where (22).  Finally  we  shall  outline 
a  way  in  which  the  present  theory  can 
be  integrated  with  existing  quantita- 
tive theories  of  conditioning  and  ex- 
tinction to  form  a  quantitative  theory 
of  discrimination. 

To  simplify  this  discussion  let  us 
consider  a  hypothetical  experiment 
using  a  situation  similar  to  that  used 
by  Wilcoxon,  Hays,  and  Hull  (21), 
and  later  used  by  Hull  (10)  for  a  dis- 
crimination experiment.  In  this  ex- 
periment a  rat  was  placed  in  a  small 
compartment  with  a  single  exit 
through  a  door  into  a  goal  compart- 
ment. A  measure  of  the  latency  of 
the  response  of  running  through  this 
door  was  obtained.  The  discrimina- 
tive stimuli  consisted  of  a  black  or  a 
white  door,  either  one  of  which  was 
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present  on  each  trial.  During  dis- 
crimination training  the  running  re- 
sponse was  reinforced  with  food  when 
one  color  was  present,  whereas  rein- 
forcement was  withheld  when  the 
other  color  was  present.  Each  stimu- 
lus was  present  on  an  average  of  50 
per  cent  of  the  trials. 

For  purposes  of  the  present  dis- 
cussion let  us  consider  a  slightly  differ- 
ent situation  in  which  the  discrimina- 
tive stimuli  are  placed  overhead 
rather  than  directly  in  front  of  S.  In 
this  case  an  observing  response, 
raising  the  head,  will  be  necessary  if 
5  is  to  be  exposed  to  the  discrimina- 
tive stimuli.  On  each  trial,  when  5  is 
placed  in  the  apparatus,  there  will  be 
a  certain  probability  that  the  Ro  of 
looking  up  will  occur.  When  Ro  does 
occur  S  will  be  exposed  either  to  a 
black  or  a  white  card.  When  the  Ro 
fails  to  occur  S  will  not  be  exposed  to 
either  card,  but  rather  to  a  neutral 
population  of  stimuli  (walls,  floor, 
etc.).  Note  that  in  this  situation  S 
does  not  improve  its  chances  of  ulti- 
mate reinforcement  by  making  the  Ro. 
The  food  is  placed  in  the  goal  com- 
partment whenever  the  white  card  is 
present  whether  S  actually  looks  up 
or  not.  In  a  sense  then,  S  gains  only 
information  by  making  the  Ro. 

We  are  now  in  a  position  to  examine 
the  relation  between  observing  re- 
sponses and  stimulus  generalization. 
In  general  it  is  apparent  that  if  po  has 
a  low  value,  S  will  seldom  be  exposed 
to  the  discriminative  stimuli  (the 
black  and  white  cards).  S  therefore, 
will  have  minimum  opportunity  to 
learn  discrimination  or  to  manifest 
any  discrimination  already  learned. 
On  the  other  hand,  if  po  has  a  high 
value,  the  opportunity  to  learn  or 
manifest  discrimination  will  be  large. 

Stimulus  generalization  between 
two  stimuli  is  usually  defined  either 
in  terms  of  6"s  tendency  to  respond 


similarly  to  the  two  stimuli,  or  in 
terms  of  failure  to  learn  differential 
responses  readily.  Thus  we  can  see 
that  stimulus  generalization  will  de- 
crease as  po  increases. 

If  we  assume  that  po  changes  as  a 
result  of  learning  processes  we  can  see 
that  these  changes  would  give  rise  to 
changes  in  generalization  between  the 
stimuli  involved.  More  specifically, 
if  we  assume  that  po  will  increase  dur- 
ing discrimination  learning  (differen- 
tial reinforcement),  generalization  be- 
tween the  discriminative  stimuli  will 
decrease.  Similarly,  we  might  as- 
sume that  po  will  decrease  if  we  intro- 
duce a  procedure  in  which  the  subject 
is  reinforced  equally  often  in  the  pres- 
ence of  either  stimulus  (non-differ- 
ential reinforcement).  This  decrease 
in  po  would  give  rise  to  an  increase  in 
generalization  between  the  stimuli. 

In  the  case  of  the  hypothetical  ex- 
periment suggested  above,  generaliza- 
tion will  be  shown  in  a  "crossover" 
effect  between  positive  and  negative 
trials.  Reinforcements  on  positive 
trials  (positive  stimulus  card  present 
but  not  necessarily  observed)  will 
tend  to  strengthen  the  effective  re- 
sponse on  negative  trials,  while  unrein- 
forced  responses  on  negative  trials  will 
tend  to  weaken  the  effective  response 
on  positive  trials.  If  5"s  tendency  to 
look  up  increases  during  differential 
reinforcement,  this  "crossover"  effect 
will  decrease.  If  during  non-differ- 
ential reinforcement  the  tendency  to 
look  up  decreases,  the  "crossover" 
effect  will  increase. 

It  should  be  emphasized  that  these 
statements  regarding  increases  and 
decreases  in  po  are,  at  this  point,  as- 
sumptions which  may  or  may  not  be 
true  in  a  particular  experimental  situ- 
ation. We  shall  present  experimental 
findings  which  suggest  that  these  as- 
sumptions are  quite  generally  true 
below. 
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In  the  above  discussion  we  have 
considered  the  effects  of  Ro  on  dis- 
crimination and  generaHzation.  At 
this  point  we  turn  our  attention  to  the 
problem  of  accounting  for  changes  in 
po  within  the  framework  of  reinforce- 
ment learning  theory.  Our  problem 
will  be  to  identify  possible  reinforcing 
conditions  which  may  account  for  in- 
creases in  po  during  differential  rein- 
forcement. 

First  we  note  that,  by  definition, 
the  observing  response  results  in  ex- 
posure to  a  pair  of  discriminative 
stimuli.  If  exposure  to  these  stimuli 
is  in  some  way  reinforcing,  we  shall 
expect  po  to  increase  or  remain  high. 
The  problem  at  hand  is  to  show  how 
exposure  to  discriminative  stimuli 
may  have  a  reinforcing  effect  under 
the  condition  of  differential  reinforce- 
ment, while  the  same  stimuli  do  not 
have  this  effect  under  the  condition  of 
non-differential  reinforcement.  Re- 
inforcement theory  provides  two  ways 
of  accounting  for  this  reinforcing 
effect. 

The  first  method  is  the  mechanism 
suggested  by  Spence  when  he  states 
that  observing  responses  are  learned 
"because  they  are  followed  within  a 
short  temporal  interval  by  the  final 
goal  response"  (19).  This  mechanism 
will  operate  in  experiments  such  as  a 
"jumping  stand"  experiment,  in 
which  exposure  to  discriminative  stim- 
uli may  serve  to  increase  the  prob- 
ability of  prompt  reinforcement,  that 
is  to  say,  the  probability  of  the  "cor- 
rect" jump  may  be  increased.  Spence 
offered  this  suggestion  in  relation  to  a 
jumping  stand  experiment. 

The  second  method  of  accounting 
for  the  reinforcing  effect  is  by  appeal 
to  the  principles  of  secondary  rein- 
forcement. Here  we  suggest  that  the 
discriminative  stimuli  themselves  take 
on  secondary  reinforcing  value  during 
the  course  of  discrimination  learning. 


It  has  been  demonstrated  that  an  origi- 
nally neutral  stimulus  which  accompanies 
reinforcement  may  acquire  secondary  re- 
inforcing properties.  That  is,  it  may  serve 
to  strengthen  a  response  upon  which  it  is 
made  contingent.  Skinner  (18,  p.  246) 
has  demonstrated  that  whenever  a  stimulus 
becomes  a  discriminative  stimulus  for  some 
response  in  a  chain  leading  ultimately  to 
reinforcement,  this  stimulus  will  serve  as  a 
secondary  reinforcing  stimulus.  The  con- 
ditions necessary  for  the  formation  of  sec- 
ondary reinforcing  properties  are  further 
considered  by  Notterman  (16),  Schoenfeld 
et  al.  (17)  and  Dinsmoor  (4).  They  point 
out  that  in  all  cases  where  secondary  rein- 
forcement has  been  demonstrated,  the  con- 
ditions were  also  appropriate  for  the  estab- 
hshment  of  the  stimulus  in  question  as  a 
discriminative  stimulus.  They  suggest  that 
this  may  be  a  necessary  (as  well  as  suffi- 
cient) condition  for  the  establishment  of 
secondary  reinforcing  properties.  In  the 
present  formulation  it  is  apparent  that  the 
positive  stimulus  is  presented  in  the  ap- 
propriate temporal  position  to  become  both 
a  discriminative  stimulus  (for  the  effec- 
tive response)  and  a  secondary  reinforcing 
stimulus  (for  the  observing  response). 

This  mechanism  may  operate  in 
any  situation  whatever  where  an  Ro 
is  involved,  since  it  is  a  defining  char- 
acteristic of  the  Ro  that  it  leads  to  ex- 
posure to  discriminative  stimuli. 
Specifically  it  should  apply  to  the 
hypothetical  experiment  suggested 
above.  Here  the  effective  response 
(running)  will  always  be  reinforced 
when  S  is  exposed  to  the  white  card. 
Hence  the  white  card  could  be  ex- 
pected to  acquire  secondary  reinforc- 
ing value.  It  is  not  sufficient  to  show 
simply  that  the  positive  stimulus  will 
acquire  secondary  reinforcing  value. 
We  must  also  consider  two  other 
factors.  First,  Ro  results  in  exposure 
to  the  positive  stimulus  only  50  per 
cent  of  the  time.  It  results  in  ex- 
posure to  the  negative  stimulus  the 
other  50  per  cent.  Second,  the  run- 
ning response  is  reinforced  sometimes 
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when  5  is  exposed  to  the  neutral 
stimulus  population,  since,  on  positive 
trials,  the  running  response  is  rein- 
forced even  though  5  does  not  look 
up.  The  effective  response  is  rein- 
forced most  consistently  when  5  is 
exposed  to  the  positive  stimulus. 
Therefore,  it  is  still  plausible  to 
postulate  that  the  intermittent  ex- 
posure to  the  positive  and  negative 
stimuli  will  have  a  net  reinforcing 
effect  on  Rg. 

It  is  true  of  both  of  these  mecha- 
nisms that,  before  any  increase  in  po 
can  be  expected  to  occur,  6"  must  learn 
differential  effective  responses,  that 
is  to  say,  5  must  learn  to  respond 
differently  to  the  two  discriminative 
stimuli.  In  the  case  of  the  "jumping 
stand"  experiment,  if  5  does  not  have 
differential  jumping  tendencies  to- 
ward the  discriminative  stimuli,  the 
probability  of  reinforcement  will  al- 
ways be  50  per  cent,  and  will  not  be 
improved  by  the  occurrence  of  Ro- 

When  we  apply  the  secondary  rein- 
forcement principle  we  can  see  that 
the  positive  stimulus  must  appear  in 
the  proper  temporal  relation  to  rein- 
forcement a  number  of  times  before 
this  stimulus  will  acquire  secondary 
reinforcing  properties.  In  terms  of 
Notterman,  Schoenfeld,  and  Dins- 
moor's  interpretation  it  will  be  nec- 
essary for  5  to  learn  differential  effect- 
ive responses  to  the  discriminative 
stimuli  before  secondary  reinforcing 
properties  are  acquired  by  these 
stimuli. 

In  view  of  these  considerations  we 
introduce  the  following  general  hy- 
pothesis: Exposure  to  discriminative 
stimuli  will  have  a  reinforcing  effect 
on  the  observing  response  to  the  ex- 
tent that  5  has  learned  to  respond 
differently  to  the  two  discriminative 
stimuli. 

Hereafter  we  shall  refer  to  the 
magnitude  of  the  difference  between 


5s'  tendencies  to  respond  to  the  two 
discriminative  stimuli  as  the  "degree 
of  discrimination." 

Earlier  it  was  pointed  out  that  the 
probability  of  occurrence  of  Ro  is  one 
of  the  factors  determining  the  rate  of 
formation  of  discrimination.  Accord- 
ing to  the  present  hypothesis  the  op- 
posite relationship  is  also  true.  The 
resulting  picture  is  one  of  a  circular 
interrelationship,  in  which  Ro  affects 
the  formation  of  discrimination  be- 
cause of  its  effect  on  exposure  to  dis- 
criminative stimuli,  while  the  degree 
of  discrimination  affects  Ro  through 
another  mechanism  involving  either 
secondary  reinforcement  or  changes 
in  the  probability  of  reinforcement. 

We  now  present  four  propositions 
which  are  implied  by  this  general 
hypothesis.  The  hypothesis  was 
formulated  partly  on  the  basis  of 
experimental  evidence  already  avail- 
able, which  suggested  that  these 
propositions  were  true  (22).  At  pres- 
ent we  shall  consider  them  as  specific 
hypotheses.  The  first  two  of  these 
have  already  been  introduced  as  as- 
sumptions. 

1.  po  will  increase  (or  remain  high) 
under  conditions  of  differential  rein- 
forcement. 

2.  po  will  decrease  (or  remain  low) 
under  conditions  of  non-differential 
reinforcement. 

It  is  apparent  that  these  hypotheses 
are  consistent  with  the  general  hy- 
pothesis since  the  degree  of  discrimi- 
nation will  tend  to  increase  (or  remain 
high)  under  differential  reinforcement, 
while  it  will  tend  to  decrease  (or 
remain  low)  under  nondifferential 
reinforcement.  In  other  words,  5 
will  learn  to  respond  differently  to  the 
two  stimuli  under  differential  rein- 
forcement, but  will  learn  to  respond 
in  the  same  way  to  them  under  non- 
differential  reinforcement.    Additional 
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hypotheses  of  interest  can  be  derived 
from  this  general  hypothesis. 

3.  When  a  well  established  dis- 
crimination is  reversed  po  will  de- 
crease temporarily  and  then  return  to 
a  high  value. 

We  shall  expect  this  change  in  po 
because,  following  a  reversal,  the  de- 
gree of  discrimination  will  decrease 
as  the  original  discrimination  van- 
ishes. It  will  then  increase  as  the 
new  discrimination  is  formed. 

4.  If  at  some  point  in  an  experiment 
the  degree  of  discrimination  is  low  and 
at  the  same  time  po  is  low  (but  greater 
than  zero) ,  we  shall  expect  the  forma- 
tion of  discrimination  to  be  retarded 
for  some  interval,  but  finally  to  occur 
quite  rapidly. 

This  hypothesis  arises  from  the  fact 
that  increases  in  the  degree  of  dis- 
crimination, and  increases  in  po,  are 
dependent  upon  each  other.  Early  in 
the  process  5  will  be  exposed  to  the 
discriminative  stimuli  only  a  small 
proportion  of  the  time  and  hence  the 
degree  of  discrimination  cannot  in- 
crease rapidly.  At  the  same  time  po 
will  not  increase  because  of  the  low 
degree  of  discrimination.  Then,  as 
the  degree  of  discrimination  becomes 
sufficiently  great  to  bring  about  an 
increase  in  po  the  entire  learning  proc- 
ess will  be  accelerated. 

Krechevsky  (11)  presents  data  obtained 
in  discrimination  experiments  in  a  jumping 
stand  situation  which  correspond  in  some 
respects  to  the  predictions  of  the  pres- 
ent formulation.  Curves  for  individual  5s 
show  relatively  abrupt  discrimination  for- 
mation. In  general  the  curves  also  show 
a  slight  improvement  in  discrimination 
prior  to  the  abrupt  change.  A  curve  pre- 
sented for  discrimination  reversal  shows  a 
rapid  decrease  in  the  degree  of  discrimina- 
tion to  a  chance  level,  followed  by  an  in- 
terval during  which  improvement  was  much 
less  rapid.  Finally  the  process  accelerated 
as    the    reversed    discrimination    formed. 


Krechevsky  also  noted  that  during  the  in- 
terval while  S  was  responding  approxi- 
mately according  to  chance  with  respect  to 
the  discriminative  stimuU,  he  showed  a 
strong  position  preference.  These  findings 
are  in  complete  agreement  with  hypotheses 
3  and  4  in  the  present  formulation. 

The  four  hypotheses  presented  so 
far  were  tested  in  an  experiment  by 
the  writer  (22)  which  is  presented  in 
detail  elsewhere.  In  this  experiment 
direct  measures  of  an  Ro  were  obtained 
during  differential  reinforcement,  non- 
differential  reinforcement  and  dur- 
ing discrimination  reversal.  Pigeons 
were  used  in  a  Skinner-box  situation 
in  which  the  effective  response  was 
striking  a  single  translucent  key. 
The  discriminative  stimuli  were  col- 
ored lights  (red  and  green)  projected 
on  the  back  of  the  key  one  at  a  time. 
The  colored  lights  were  withheld  and 
the  key  was  lighted  white  until  the 
Ro  occurred.  The  Ro  consisted  of 
stepping  on  a  pedal  on  the  floor  of  the 
compartment.  The  reasons  for  using 
this  response  as  an  observing  response 
are  discussed  in  detail  elsewhere  (22). 
Here  it  will  suffice  to  say  that  this  re- 
sponse falls  within  our  definition  of  an 
observing  response  in  that  it  resulted 
in  exposure  to  the  discriminative 
stimuli.  As  in  the  case  of  the  hy- 
pothetical experiment  discussed 
above,  the  observing  response  had  no 
effect  on  the  probability  of  reinforce- 
ment at  any  given  moment. 

All  of  the  above  hypotheses  were 
supported  by  the  results  of  this  ex- 
periment. Concerning  the  first  three 
hypotheses,  po  was  higher  under  differ- 
ential reinforcement  than  under  non- 
differential  reinforcement.  When  5s 
were  shifted  from  differential  to  non- 
differential  reinforcement  a  marked 
decrease  in  po  occurred.  All  of  these 
differences  were  significant  at  a  5  per 
cent  level  of  confidence  or  better. 

The  fourth  hypothesis  does  not  ap- 
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ply  unless  at  some  point  in  the  experi- 
ment the  degree  of  discrimination  and 
Po  are  both  low.  This  condition  was 
not  satisfied  consistently  since  the 
operant  (or  base)  level  of  the  pedal 
response  turned  out  to  be  relatively 
high  for  5s.  However,  in  several 
cases  this  condition  was  satisfied  and 
in  these  cases  the  results  conformed 
to  the  hypothesis. 

We  can  now  illustrate  some  ways 
in  which  this  theory  might  be  useful 
in  interpreting  behavior  in  other  ex- 
periments. 

1.  If  this  theory  is  applied  to  situ- 
ations in  which  more  than  one  pair  of 
discriminative  stimuli  is  involved  we 
can  make  some  predictions  regarding 
changes  in  the  readiness  of  5  to  form 
discriminations  based  on  some  par- 
ticular pair  of  stimuli. 

2.  It  has  been  demonstrated  that 
when  a  discrimination  is  reversed  re- 
peatedly 5s  tend  to  learn  the  reversed 
discrimination  more  and  more  rapidly 
(15,  8).  According  to  the  present 
theory,  during  discrimination  reversal 
the  observing  response  is  partially 
extinguished  and  reconditioned. 
Thus,  during  repeated  reversals,  the 
Ro  is,  in  effect,  reinforced  intermit- 
tently. Studies  of  intermittent  rein- 
forcement have  indicated  that  when  a 
response  is  intermittently  extin- 
guished and  reconditioned,  the 
strength  of  the  response  tends  to  at- 
tain a  relatively  constant  high  value 
(18).  On  the  first  reversal  po  might 
drop  to  a  low  value,  and  recover 
slowly,  but  with  repeated  reversals 
we  would  expect  this  drop  to  become 
less  prominent,  and  finally,  po  would 
remain  high  throughout  the  reversal. 
It  is  apparent  that  if  po  remained 
high,  a  reversed  discrimination  would 
be  learned  more  rapidly  than  other- 
wise. 

In  the  preceding  discussion  we  have 
examined  some  of  the  ways  in  which 


discrimination  learning  may  be  af- 
fected when  some  observing  response 
is  required  of  S.  We  shall  now  derive 
some  quantitative  statements  to  sup- 
plement the  above  analysis.  We 
shall  attempt  to  set  down  the  relation- 
ships involved  in  such  a  way  that  the 
present  theory  can  be  readily  inte- 
grated into  existing  quantitative  the- 
ories of  learning  such  as  Hull's  (9), 
Estes'  (6)  or  Bush  and  Mosteller's 
(3).  The  potential  applications  of 
this  development  could  proceed  along 
two  different  lines. 

First,  we  could  attempt  to  state  the 
relationships  between  observing  re- 
sponses and  measurable  aspects  of  the 
effective  responses  in  such  a  way  that 
po  could  be  estimated  in  situations 
where  direct  measurement  of  Ro  is 
not  feasible.  This  might  be  the  case, 
for  example,  if  the  Ro  involved  focus- 
ing of  the  eye.  If  we  apply  the  pres- 
ent development  in  this  way,  po 
would  become  an  intervening  vari- 
able, which  could  be  used  to  account 
for  and  predict  behavior  in  situations 
where  (1)  the  apparent  generalization 
between  stimuli  changes,  or  (2)  where 
the  ease  of  formation  of  discrimination 
changes  as  a  function  of  training. 
Berlyne  (1)  suggests  that  "attention" 
be  treated  in  a  similar  way. 

Secondly,  we  could  predict  dis- 
crimination learning  functions  by 
adopting  some  set  of  assumptions  re- 
garding the  component  learning  pro- 
cesses involved.  These  assumptions 
could  be  adopted  from  some  existing 
theory  which  treats  the  simpler  proc- 
esses of  conditioning  and  extinction. 
The  main  obstacle  to  this  endeaver  at 
the  moment  is  the  absence  of  any 
quantitative  function  for  predicting 
changes  in  po.  However,  we  shall 
be  able  to  set  down  the  relationships 
involved  in  such  a  way  that  any  ac- 
ceptable function  can  immediately  be 
inserted. 
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Quantitative  Analysis 

For  purposes  of  this  analysis  let  us 
return  to  consideration  of  the  hypo- 
thetical experiment  discussed  above. 
There  it  was  pointed  out  that  we  must 
take  into  consideration  three  different 
stimulus  populations  which  may  effect 
5's*  behavior.  We  shall  adopt  the  fol- 
lowing notation  to  represent  these 
stimuli.  Let  Si  represent  the  stimu- 
lus population  to  which  S  is  exposed  on 
trials  when  the  Ro  occurs  and  when 
the  positive  stimulus  card  (white)  is 
present,  S2  represent  the  stimulus 
population  on  trials  when  the  Ro  oc- 
curs and  when  the  negative  stimulus 
card  (black)  is  present,  and  5$  the 
stimulus  population  to  which  S  is 
exposed  when  the  Ro  fails  to  occur. 

In  this  analysis  we  shall  use  the 
symbol  p  to  represent  the  probability 
of  occurrence  of  the  effective  response 
at  any  given  moment  during  a  trial. 
This  variable  can  be  related  to  the 
variable  of  response  latency  as  follows. 
Estes  (6)  has  shown  that  if  a  response 
can  be  expected  to  occur  with  a  given 
probability  at  any  moment  during  a 
trial,  the  mean  latency  of  the  response 
will  be  proportional  to  the  reciprocal 
of  the  probability;  that  is  to  say, 
L  =  k/p,  where  L  is  the  mean  latency, 
p  the  probability,  and  k  a  constant  of 
proportionality  which  will  depend  on 
the  units  of  measurement  used.  In 
the  present  case  we  must  consider  the 
probability  of  occurrence  of  the  effect- 
ive response  for  each  of  three  stimulus 
populations.  Let  us  adopt  the  sym- 
bols pi,  pi,  and  pz  to  represent  the 
probability  of  occurrence  of  the  effect- 
ive response  when  5  is  exposed  to  Si, 
Si,  and  Sz,  respectively.  We  shall 
also  wish  to  refer  to  the  net  probabil- 
ity of  occurrence  of  the  effective  re- 
sponse on  a  given  trial,  taking  into 
account  that  S  may  be  exposed  to 
different  stimuli  during  the  trial  de- 
pending on  the  occurrence  or  non-oc- 


currence of  the  Ro.  We  shall  use  the 
symbols  />+  and  p^  to  represent  the 
net  probability  on  trials  when  the 
positive  or  negative  stimuli  are  pres- 
ent. 

To  summarize: 

Si  =  the  population  of  stimuli  to 
which  S  is  exposed  if  (1)  the 
positive  stimulus  is  present  and 
(2)  the  Ro  occurs. 

52  =  the  population  of  stimuli  to 
which  S  is  exposed  if  (1)  the 
negative  stimulus  is  present 
and  (2)  the  Ro  occurs. 

Sz  =  the  population  of  stimuli  to 
which  5  is  exposed  if  the  observ- 
ing response  fails  to  occur. 
p  =  the  probability  that  the  effect- 
ive response  will  occur  at  any 
given  moment  during  a  trial 
(=  k/L) 

pi  =  the  value  of  p  when  5  is  ex- 
posed to  Si 

p2  =  the  value  of  p  when  S  is  ex- 
posed to  52 

p3  =  the  value  of  p  when  5  is  exposed 
to  53 

p+  =  the  net  value  of  p  for  a  trial  on 
which  the  positive  stimulus  is 
present 

p-  =  the  net  value  of  p  for  a  trial  on 
which  the  negative  stimulus  is 
present 

po  =  the  probability  of  occurrence  of 
Ro  at  any  given  moment  during 
a  trial. 

We  shall  now  express  certain  func- 
tional relationships  among  these  vari- 
ables. First  we  shall  express  p+  and 
P-.  as  two  functions  of  the  variables 
pi,  p2,  pz,  and  po.  p+  and  p-  are 
variables  which  can  be  evaluated 
from  experimental  measures,  such  as 
latency  of  the  effective  response,  with- 
out reference  to  direct  measures  of 
Ro.  They  correspond  to  the  meas- 
ures of  response  tendency  usually  ob- 
tained in  discrimination  experiments. 
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However,  in  the  present  framework 
p+  and  p-  are  assumed  to  be  the  net 
result  of  the  operation  of  the  variables 
pi,  p2,  pz,  and  po.  Our  task  will  be  to 
express  this  dependence  as  a  pair  of 
functional  relationships.  This  can  be 
done  as  follows. 

Consider  a  selected  moment  during 
a  positive  trial.  At  this  moment  5 
will  be  exposed  to  either  ^i,  with  a 
probability  of  po,  or  to  Sz,  with  a 
probability  of  (1  —  p^.  If  5  is  ex- 
posed to  Si  he  will  make  the  effective 
response  with  a  proabability  of  pi. 
If  the  effective  response  and  Ro  are 
independent  of  each  other  the  prob- 
ability that  both  Ro  and  the  effective 
response  will  occur  will  be  the  product 
pipo.  If  S  is  exposed  to  ^3  he  will 
make  the  effective  response  with  a 
probability  of  pz,  and  the  probability 
that  both  will  occur  will  be  the  prod- 
uct (1  —  po)pz.  The  total  probabil- 
ity that  the  effective  response  will 
occur  at  this  moment  will  be  the  sum 
of  these  products.     Thus : 

P+    =   Popl  +    (1    -  po)  pz.  (1) 

By  exactly  parallel  reasoning  with 
respect  to  a  selected  moment  during 
a  negative  trial  we  obtain : 

/)_  =  P0P2  +  (1  -  Po)  Pz.       (2) 

The  next  step  will  be  to  derive  ex- 
pressions for  predicting  the  values  of 
pi,  p2,  and  Pz.  The  reinforcement 
contingencies  for  the  effective  response 
in  the  presence  of  ^i,  52,  and  Sz  can 
be  readily  ascertained.  It  will  be 
possible  to  predict  changes  in  the 
values  of  pi,  p2,  and  pz  on  the  basis  of 
learning  functions  for  the  simpler 
processes  of  conditioning  and  extinc- 
tion if  we  assume  that  learning  with 
respect  to  each  of  these  stimuli,  pro- 
ceeds independently  of  learning  with 
respect  to  the  others.  This  assump- 
tion implies  that  interaction  between 
stimuli  will  have  a  negligible  effect. 


However,  in  making  this  assumption 
we  do  not  forfeit  the  ability  to  handle 
stimulus  generalization  within  the 
present  framework,  since,  as  we  have 
already  pointed  out,  stimulus  general- 
ization can  be  accounted  for  without 
postulating  any  such  direct  interac- 
tion. 

In  the  present  paper  we  do  not 
adopt  a  particular  set  of  functions  for 
conditioning  and  extinction,  but  at- 
tempt to  set  down  the  relationships 
in  such  a  way  that  any  acceptable  set 
of  functions  can  be  immediately  in- 
serted. 

The  assumption  of  "negligible  di- 
rect interaction"  implies  that  changes 
in  the  probability  of  occurrence  of 
the  effective  response  with  respect  to 
a  particular  stimulus  population  5,- 
(i  =  1,  2,  or  3)  will  occur  only  during 
the  time  in  which  S  is  exposed  to  Si, 
and  that  the  rate  of  change  with 
respect  to  time  will  depend  on : 

1.  Whether  or  not  the  effective 
response  is  reinforced. 

2.  The  value  of  pi  at  the  time. 

If  we  let  r,-  represent  the  proportion 
of  the  time  during  which  S  is  exposed 
to  Si,  the  rate  of  change  of  pi  can  be 
approximated  by  two  functions  as 
follows : 

dpi/dt  =  fifoipi)  (3) 

if  the  effective  response  is  reinforced, 
and 

dPi/dt   =   TifeiPi)  (4) 

if  the  effective  response  is  not  rein- 
forced. 

The  functions  /«  and  /«  represent 
any  acceptable  set  of  analytic  func- 
tions which  approximate  the  rate  of 
change  of  probability  of  occurrence 
of  an  effective  response  during  condi- 
tioning and  extinction,  respectively. 
It  will  be  noted  that  if  we  assign  a 
value  of  1  to  r  we  will  obtain  expres- 
sions for  simple  cases  of  conditioning 
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or  extinction.  In  the  present  model 
the  values  of  r^  can  be  expressed  as 
functions  of  po  as  follows.  The  posi- 
tive and  negative  stimuli  are  each  to 
be  present  50  per  cent  of  the  time. 
During  this  time  the  subject  will  be 
exposed  to  5i  or  Si  with  a  probability 
of  po.     Hence : 

Ti   =   Ti   =    .5po. 

S  will  be  exposed  to  S3  with  a  prob- 
ability of  (1  —  po)-     Hence: 

r3  =  (1  -  po). 

We  also  know  that  all  effective  re- 
sponses in  the  presence  of  Si  are  rein- 
forced, effective  responses  in  the 
presence  of  S2  are  not  reinforced,  and 
effective  responses  in  the  presence  of 
^3  are  reinforced  an  average  of  one- 
half  of  the  time.  Using  the  above 
values  of  r  and  appropriate  functions 
for  reinforced  and  non-reinforced  re- 
sponses we  obtain : 

dpi/dt  =  .SpoUpi)  (5) 

dP'i/dt  =  .Spofeipi)  (6) 

dpz/dt  =  .5(1  -  po)Mpz) 

+  .5(1  -  Po)Up3).     (7) 

We  can  now  outline  the  steps  which 
would  be  necessary  to  predict  p+  and 
p-  (measurable  aspects  of  effective 
responses)  if  we  can  predict  the  values 
of  po  as  a  function  of  time.  Such  a 
function  could  be  derived  empirically 
or  through  some  theoretical  state- 
ment regarding  the  factors  which 
bring  about  changes  in  po.  U  po  can 
be  expressed  as  a  function  of  time  we 
can  rewrite  equations  5,  6,  and  7  to 
obtain  expressions  involving  only 
dpi,  dt,  pi  and  t.  If  these  differential 
equations  can  be  solved  we  will  ob- 
tain pi  =  fi{t).  Thus  we  can  obtain 
values  of  pi,  p2,  ps,  and  po  for  any 
point  in  time.  These  values  can  be 
substituted  in  equations  1  and  2  to 


give  the  desired  prediction  of  p+  and 

P- 

On  the  other  hand  if  we  wish  to 
estimate  the  values  of  po  from  known 
values  of  p+  and  /?_,  we  can  proceed 
as  follows. 

Equations  1  and  2  state: 

p+  =  Popi  +  (1  -  po)p3,        (1) 

p.  =  P0P2  +  (1  -  po)pz.       (2) 

Differentiating  with  respect  to  time 
we  obtain : 

dp+/dt  =  poidpi/dt)+pi{dpo/dt) 
-f  (1  -Po){dp3/dt)-pz{dpo/dt),     (8) 

dpjdt  =  po{dp2/dt)  +p2{dpo/dt) 

+  {\-Po){dp^/dt)-Pi{dPo/dt).     (9) 

Substituting  values  for  dpJdt,  dpz/ 
dt  and  dpz/dt  from  equations  5,  6,  and 
7  and  rearranging  terms  we  obtain : 

dp+/dt  =  .5poJc(pi) 

+  .5{l-Po)mps)+fe(Pz)l 

+  {pi-pz){dPo/dt),     (10) 

dpJdt  =  .5poJe{p2) 

+  .5{l-Po)'Uc{Pz)He{Ps)l 

-\-{p2-pz){dpo/dt).      (11) 

Equations  1,  2,  10,  and  11  represent 
four  simultaneous  equations.  By 
combining  these  equations  we  can 
express  pi,  pz,  and  pz  as  functions  of 
the  other  variables  and  obtain  a 
single  expression : 

dpo/dt  =  G(p+,  P-, 

dp+/dt,  dpJdt,  po),     (12) 

where  the  function  G  will  depend  on 
the  functions /e  and/«  adopted  for  the 
conditioning  and  extinction  functions. 
Now,  if  the  curves  representing  the 
values  of  />+  and  p-  are  determined 
experimentally,  we  can  express  these 
variables  as  analytic  functions  of  time. 
We  can  also  obtain  expressions  for 
dp+/dt  and  dp-/dt  as  functions  of  time. 
SulDstituting    the    functions    for    p+, 
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p_,  dpj^/dt  and  dp-/dt  in  equation  12 
we  obtain  : 


dpo/dt  =  G'{t,  po). 


(13) 


If  this  differential  equation  can  be 
solved  we  obtain : 


Po   =fo{t). 


(14) 


This  equation  will  give  us  the 
desired  value  of  po  for  any  point  in 
time  during  the  experiment. 

Summary 

In  many  discrimination  learning 
situations  some  response,  such  as  an 
orienting  response,  will  be  required  of 
S  before  he  is  exposed  to  the  dis- 
criminative stimuli.  We  call  these 
responses  "observing  responses"  {R^, 
and  indicate  their  probability  of  oc- 
currence as  po.  Increases  in  po  will 
result  in  increased  exposure  to  the  dis- 
criminative stimuli,  and  hence  in- 
creased opportunity  for  5  to  learn  or 
manifest  discrimination.  Decreased 
po  will  have  the  opposite  effect. 
These  results  are  operationally  equiv- 
alent to  decreases  or  increases  in 
stimulus  generalization  between  the 
discriminative  stimuli.  The  follow- 
ing general  hypothesis  regarding 
changes  in  po  can  be  derived  from  the 
principle  of  secondary  reinforcement. 

Hypothesis:  Exposure  to  discrimi- 
native stimuli  will  have  a  reinforcing 
effect  on  the  observing  response  to  the 
extent  that  S  has  learned  to  respond 
differently  to  the  two  discriminative 
stimuli. 

From  this  general  hypothesis  we 
derive  the  following  specific  hypothe- 
ses: 

1.  po  will  increase  (or  remain  high) 
under  conditions  of  differential  rein- 
forcement   (discrimination    training) ; 

2.  po  will  decrease  (or  remain  low) 
under  conditions  of  nondifferential 
reinforcement ; 


3.  When  a  well  established  discrim- 
ination is  reversed,  po  will  decrease 
temporarily  and  then  recover; 

4.  If  the  degree  of  discrimination 
and  po  are  both  low,  the  formation  of 
discrimination  will  be  retarded  for 
some  interval  but  will  finally  occur 
quite  rapidly. 

Evidence  in  support  of  these  spe- 
cific hypotheses  was  obtained  in  an 
experiment  in  which  an  Ro  was  meas- 
ured directly. 

This  formulation  may  be  useful  for 
interpreting  behavior  in  cases  where 
changes  in  generalization  between 
stimuli  occur,  and  where  the  ease  of 
formation  of  discrimination  on  the 
basis  of  some  particular  set  of  stimuli, 
changes  as  a  function  of  training. 
5s  learn  reversed  discriminations  more 
and  more  rapidly  if  reversals  are  pre- 
sented repeatedly.  The  present  for- 
mulation offers  a  relatively  simple  and 
readily  testable  interpretation  of  this 
phenomenon. 

This  formulation  lends  itself  to  pre- 
cise quantitative  statement.  A  quan- 
titative analysis  could  be  used  in  two 
ways:  (1)  to  make  quantitative  pre- 
dictions of  behavior  based  on  some 
set  of  theoretical  statements  regard- 
ing the  component  learning  processes, 
and  (2)  to  evaluate  po  from  observa- 
tions of  measurable  aspects  of  effective 
responses.  The  steps  required  for 
such  an  analysis  are  outlined. 
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